44 stories tagged with #benchmarks, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Benchmarks"
Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive
Hive primitives benchmarked against published SOTA adversaries. Every result is a signed Ed25519 receipt from hivemorph — queryable, tamper-evident, reproducible.…
AMD EPYC 8635P "Sorano" Benchmarks: Significant Upgrade Opportunity For EPYC 8004 Servers
What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained un…
Eqbench: Emotional Intelligence Benchmarks for LLMs
Research repository for the Americas – benchmarks, models, governance
Open research repository for federated, regionally-grounded AI development across the Western Hemisphere. Maintained by GENIA Americas / RaceFor.AI. - GENIA-Americas/multimodal-ai-…
Claude Opus 4.8 Is Here: Benchmarks, Dynamic Workflows, and Whether to Upgrade From 4.7
Anthropic shipped Claude Opus 4.8 yesterday. It catches 4x more of its own code mistakes, runs hundreds of parallel subagents through Dynamic Workflows, and keeps the same price as…
LLM Benchmarks, Agent Frameworks, and the Tools That Matter in 2026 [03:37:09]
An in-depth look at the AI agent revolution reshaping software development and business automation in 2026.…
Nvidia offers restricted access to Vera CPU in first round of Linux benchmarks - 88-core monster competes with or beats Epyc and Xeon in selected tests
It's running very close to AMD's EPYC, which is incredible for a first-generation custom server core from NVIDIA.…
AI 3D tools need product evals, not benchmark faith
If you’re building AI-assisted 3D or CAD-like workflows, benchmark scores only get you so far. The real work is designing evals around your product contract and catching geometry f…
Constraint acquisition needs better benchmarks
Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by …
Initial benchmarks show Nvidia's Vera CPU, which features 88 in-house-designed Olympus cores, packs a heavy-hitting punch, beating Intel's and AMD's x86_64 CPUs (Michael Larabel/Phoronix)
Nvidia Vera CPU impresses in early Nvidia-sanctioned benchmarks
NVIDIA Vera CPU Benchmarks: Olympus Cores Delivering The Best Performance Ever Seen On ARM
Why We Need Behavioral Benchmarks for LLMs — Not Just More Knowledge Tests
Would you hire an engineer based on their SAT score? Of course not. You look at how they solve...…
You don't need all the LLM benchmarks
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has bec…
Chinese GPU maker sells out over 30,000 gaming GPUs within 48 hours despite lukewarm benchmarks — LX 7G100 proves hype trumps performance
The paper tiger that's flying off the hardware shelves.…
Qwen 3.6 benchmarks on 2x RTX PRO 6000
Design and Report Benchmarks for Knowledge Work
The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and ben…
Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?
Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly re…
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of ta…
Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks
We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and …
A brief-ish (author-consulted) guide for when to use boost::hub over plf::hive/colony, with benchmarks
Text Degeneration: A Production Failure Mode That Most Benchmarks Do Not Track
A Blog post by Dharma-AI on Hugging Face…
Ex-Google DeepMind Researcher Warns Benchmarks Won’t Save Us
Mark this.…
The bloated CPP Investment Board is trounced by its own benchmarks – again
For two decades, managers have tried – and failed – to beat the markets…
DOA: Cyberpower Pre-Built Gaming PC Doesn't Even Turn On | Review, Thermals, & Benchmarks
Gemini 3.5 flash beating gpt 5.5 a bigger and more pricer model in agentic benchmarks (second image is from zapier automation benchmarks)
What AI coding benchmarks still miss about software quality
Passing tests don't tell the whole story — your AI codebase may be quietly rotting…
Initial Benchmarks Of The SpacemiT K3 RVA23 RISC-V CPU With The K3 Pico-ITX
Honest Perf Benchmarks for a Paid-API Compiler
Four PRs, three releases, and a benchmark suite that won't lie to you: seeded-RNG corpora, double-gated Claude scenarios, and skipped-but-recorded records.…
China leaves lending benchmarks unchanged for 12th month in May - Reuters
China leaves lending benchmarks unchanged for 12th month in May Reuters…
Thanks to feedback from here I refactored my string pipeline library to focus more on CodePoint operations. The allocation reduction ended up improving benchmarks way more than I expected. <3 Thanks again.
Fix LCP, INP & CLS in 2026: The Complete Core Web Vitals Guide (With Real Benchmarks)
TL;DR Core Web Vitals (LCP, INP, CLS) directly impact your SEO rankings, bounce rates, and...…
Your benchmarks are lying to you, and your judge is to blame!
Last week I published a benchmark comparing six models across eleven agent skills. The numbers in...…
How does an expected pension impact the standard "save x times your salary by y age" retirement benchmarks?
Benchmarks- Kubernetes MCP Servers Passed. That Was Not Enough.
Kubernetes MCP servers passed our live benchmark. That was not the interesting part. The interesting...…
Big new memory tool with local benchmarks
I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here's how
GPU Hardware & Driver Update: RTX 5090 Benchmarks, llama.cpp MTP, Windows 11 Fix
GPU Hardware & Driver Update: RTX 5090 Benchmarks, llama.cpp MTP, Windows 11 Fix ...…
Aggregate Benchmarks Lie. Here's What 700 AI Functions Look Like by Security Domain.
Part 3 ranked 5 AI models by overall vulnerability rate. But when we broke the data down by security domain — database, auth, file I/O, command execution — the rankings inverted. T…
Let's Talk about Benchmarks
SpacetimeDB is a real-time backend framework and database for apps and games. Write server logic in TypeScript, C#, C++, or Rust with automatic client synchronization.…
The Core Ultra 7 270K was too good, so Intel scrapped the flagship Core Ultra 9 290K Plus — benchmarks of the 290K prototype find slim 2% faster performance in gaming and applications
The Core Ultra 9 290K Plus just wouldn't have made sense.…
Xiaomi releases MiMo-v2.5 Family weights with strong coding and agent benchmarks
Peking University gives its computer science students a compiler project every semester. Build a complete SysY compiler in Rust including lexer, parser, abstract syntax tree, IR co…