60 stories tagged with #benchmark, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Benchmark"
BEAVER: Enterprise benchmark for LLM Text-to-SQL from private data warehouses
Xiaomi releases MiMo Code V0.1.0, an open-source AI coding assistant that it says outperforms Claude Code on agentic coding and software engineering benchmarks (Carl Franzen/VentureBeat)
Carl Franzen / VentureBeat : Xiaomi releases MiMo Code V0.1.0, an open-source AI coding assistant that it says outperforms Claude Code on agentic coding and software engineering be…
Waymo says it built a better benchmark for comparing robotaxis to humans
Waymo created a new computer model to help it better understand how humans behave in crash scenarios that its robotaxis encounter.…
Benchmarks in Leipzig
Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers. Most of the work was done during the 3…
OpenAI research and product leads detail GPT-Rosalind capabilities and benchmarks - R&D World
Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.…
Chrome for Mac breaks benchmark records on the latest MacBook Pro
Google has shared the results of the latest Chrome performance benchmarks, including record scores on tests running on an M5 MacBook Pro.…
Today Marks 22 Years Of Phoronix For Linux Hardware Testing & Benchmarking
Today marks 22 years since I started Phoronix.com to focus on Linux hardware reviews…
Benchmark raises its first-ever growth fund as part of $2B capital raise
The legendary abandons its more than 20 year tradition of keeping its funds to about $425 million.…
Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive
Hive primitives benchmarked against published SOTA adversaries. Every result is a signed Ed25519 receipt from hivemorph — queryable, tamper-evident, reproducible.…
AMD EPYC 8635P "Sorano" Benchmarks: Significant Upgrade Opportunity For EPYC 8004 Servers
Cross Cloud A2A Agent Benchmarking
Building a Benchmarking Agent with A2A and MCP This tutorial aims to build and test...…
Trump signs AI executive order seeking 30-day government access to frontier models before release — voluntary framework will include classified benchmark to determine which models qualify
The voluntary framework avoids mandatory licensing but gives the government a say in which firms get early access.…
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scala…
What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained un…
DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents …
GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly unde…
ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models
Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchma…
MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmark…
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data member…
Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition
Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordin…
Why do we benchmark quants on perplexity and prose but never on tool call validity?
Someone benchmarked on how accurate different AI are on excel documents
We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper
by VEKTOR Memory — 10 min read…
Benchmarking time-series databases for ecommerce infrastructure monitoring
Time-series database performance under ecommerce load: real benchmark results Your...…
PROMPTPurify: 14 MB CPU-only prompt-injection guard (benchmarked vs. OSS guard)
Prompt-injection guardrail for LLM applications. Compact model that outperforms larger open-source guards. No regex, no signatures. Demo: anton.securelayer7.net - securelayer7/PROM…
From Benchmarketing to Benchmaxxing
AI benchmarks repeat 40 years of database benchmarketing mistakes. Learn why standard evals fall short and how to build your own.…
Eqbench: Emotional Intelligence Benchmarks for LLMs
CVE-Bench: testing LLM agents on real-world vulnerability patches
Benchmarking LLMs on real-world CVE patching…
Ask HN: How would you benchmark your engineering team's AI adoption?
Test yourself against local open-source LLMs benchmark questions
Arm Metis with GPT5.5 Cyber scores 98% on firmware vulnerability benchmark
Arm Metis is an open-source agentic AI security framework that helps detect software vulnerabilities earlier.…
Mahanadu has set new benchmark in political mobilisation, says TDP State president Palla Srinivasa Rao
TDP's Mahanadu-2026 sets a new political mobilisation benchmark, advocating 33% women’s representation and fostering democratic participation.…
Research repository for the Americas – benchmarks, models, governance
Open research repository for federated, regionally-grounded AI development across the Western Hemisphere. Maintained by GENIA Americas / RaceFor.AI. - GENIA-Americas/multimodal-ai-…
BenchBench
TL;DR: presenting the ultimate benchmark, getting models to create benchmarks for each other, and GPT 5.2 is the current (only) winner…
Claude Opus 4.8 Is Here: Benchmarks, Dynamic Workflows, and Whether to Upgrade From 4.7
Anthropic shipped Claude Opus 4.8 yesterday. It catches 4x more of its own code mistakes, runs hundreds of parallel subagents through Dynamic Workflows, and keeps the same price as…
Updated imagor 1.9.1 benchmark results for dynamic image processing
StepFun 3.7 Flash - Speed Benchmark in M5 Max
LLM Benchmarks, Agent Frameworks, and the Tools That Matter in 2026 [03:37:09]
An in-depth look at the AI agent revolution reshaping software development and business automation in 2026.…
I benchmarked 6 self-hosted book server apps up to 150K books (ingestion time + RAM/CPU)
ChatGPT-5.5 Beats Opus in Realistic Benchmark (DeepSWE)
DeepSWE finally a proper coding benchmark
Buyout Game Benchmark: 8 models play a social strategy game with public balances, private transfers, messaging, eliminations, deals, defections, and a final buyout phase. 804 games. GPT-5.5 is the champion. Opus 4.7 performs well.
BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R]
CostBench: an open benchmark for data warehouse cost-performance
Introducing CostBench, an open benchmark that turns cloud data warehouse runtime and billing models into comparable performance-per-dollar results.…
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
A Blog post by IBM Research on Hugging Face…
Benchmarking LLMs for Web Tasks
Comparisons of how LLMs perform for a bunch of web tasks…
MalShark: MCP-Powered Malware Traffic Analysis — Benchmarked Against Real Malware
MalShark: MCP-Powered Malware Traffic Analysis — Benchmarked Against Real Malware
Nvidia offers restricted access to Vera CPU in first round of Linux benchmarks - 88-core monster competes with or beats Epyc and Xeon in selected tests
It's running very close to AMD's EPYC, which is incredible for a first-generation custom server core from NVIDIA.…
noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]
Datacurve releases the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories and five languages, and says GPT-5.5 is the leader at 70% (Michael Nuñez/VentureBeat)
New DeepSWE benchmark finds Claude Opus cheats
AI 3D tools need product evals, not benchmark faith
If you’re building AI-assisted 3D or CAD-like workflows, benchmark scores only get you so far. The real work is designing evals around your product contract and catching geometry f…
Chronos vs Toto: Zero-Shot Forecasting Benchmark Results
Introduction Good forecasts help with capacity planning and quieter alerts. But one...…
Constraint acquisition needs better benchmarks
Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by …
Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism…
OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, wh…
Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation
Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based …
VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents
We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchma…