WeSearch
Hub / Tags / Benchmarks
TAG · #BENCHMARKS

Benchmarks coverage.

Every story in the WeSearch catalog tagged with #benchmarks, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

44 stories tagged with #benchmarks, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Benchmarks"

RELATED TAGS
#technology2#ai2#cpus1#intel1#hardware1#deepmind1#ai-models1#ml1#language-models1#3d1#evaluation1#cad1
THEHIVERYIQ

Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive

Hive primitives benchmarked against published SOTA adversaries. Every result is a signed Ed25519 receipt from hivemorph — queryable, tamper-evident, reproducible.…

8 views ·
#ai#technology#benchmarking
PHORONIX

AMD EPYC 8635P "Sorano" Benchmarks: Significant Upgrade Opportunity For EPYC 8004 Servers

3 views ·
ARXIV CS.AI

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained un…

6 views ·
#artificial intelligence#autonomous agents#evaluation
EQBENCH

Eqbench: Emotional Intelligence Benchmarks for LLMs

11 views ·
#technology#artificial intelligence#emotional intelligence
GITHUB

Research repository for the Americas – benchmarks, models, governance

Open research repository for federated, regionally-grounded AI development across the Western Hemisphere. Maintained by GENIA Americas / RaceFor.AI. - GENIA-Americas/multimodal-ai-…

13 views ·
#ai#research#governance
DEV.TO (TOP)

Claude Opus 4.8 Is Here: Benchmarks, Dynamic Workflows, and Whether to Upgrade From 4.7

Anthropic shipped Claude Opus 4.8 yesterday. It catches 4x more of its own code mistakes, runs hundreds of parallel subagents through Dynamic Workflows, and keeps the same price as…

7 views ·
#ai#coding#productivity
DEV.TO (TOP)

LLM Benchmarks, Agent Frameworks, and the Tools That Matter in 2026 [03:37:09]

An in-depth look at the AI agent revolution reshaping software development and business automation in 2026.…

7 views ·
#ai#automation#technology
TOM'S HARDWARE

Nvidia offers restricted access to Vera CPU in first round of Linux benchmarks - 88-core monster competes with or beats Epyc and Xeon in selected tests

It's running very close to AMD's EPYC, which is incredible for a first-generation custom server core from NVIDIA.…

12 views ·
#nvidia#cpu#benchmarking
DEV.TO (TOP)

AI 3D tools need product evals, not benchmark faith

If you’re building AI-assisted 3D or CAD-like workflows, benchmark scores only get you so far. The real work is designing evals around your product contract and catching geometry f…

11 views ·
#ai#3d#evaluation
ARXIV CS.AI

Constraint acquisition needs better benchmarks

Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by …

14 views ·
#artificial intelligence#benchmarking#mathematical programming
TECHMEME

Initial benchmarks show Nvidia's Vera CPU, which features 88 in-house-designed Olympus cores, packs a heavy-hitting punch, beating Intel's and AMD's x86_64 CPUs (Michael Larabel/Phoronix)

15 views ·
TECHSPOT

Nvidia Vera CPU impresses in early Nvidia-sanctioned benchmarks

12 views ·
PHORONIX

NVIDIA Vera CPU Benchmarks: Olympus Cores Delivering The Best Performance Ever Seen On ARM

10 views ·
DEV.TO (TOP)

Why We Need Behavioral Benchmarks for LLMs — Not Just More Knowledge Tests

Would you hire an engineer based on their SAT score? Of course not. You look at how they solve...…

11 views ·
#ai#programming#evaluation
SMOLA

You don't need all the LLM benchmarks

10 views ·
#machine learning#language models
ARXIV CS.AI

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has bec…

15 views ·
#artificial intelligence#machine learning#performance evaluation
TOM'S HARDWARE

Chinese GPU maker sells out over 30,000 gaming GPUs within 48 hours despite lukewarm benchmarks — LX 7G100 proves hype trumps performance

The paper tiger that's flying off the hardware shelves.…

19 views ·
#technology#gaming#hardware
R/LOCALLLAMA

Qwen 3.6 benchmarks on 2x RTX PRO 6000

8 views ·
ARXIV CS.AI

Design and Report Benchmarks for Knowledge Work

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and ben…

8 views ·
#artificial intelligence#benchmarking#knowledge work
ARXIV CS.AI

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly re…

7 views ·
#computer vision#artificial intelligence#language processing
ARXIV CS.AI

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of ta…

9 views ·
#artificial intelligence#machine learning#computation
ARXIV CS.AI

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and …

13 views ·
#cybersecurity#artificial intelligence#machine learning
R/CPP

A brief-ish (author-consulted) guide for when to use boost::hub over plf::hive/colony, with benchmarks

10 views ·
HUGGING FACE BLOG

Text Degeneration: A Production Failure Mode That Most Benchmarks Do Not Track

A Blog post by Dharma-AI on Hugging Face…

11 views ·
#language models#machine learning#ai research
GIZMODO

Ex-Google DeepMind Researcher Warns Benchmarks Won’t Save Us

Mark this.…

12 views ·
#artificial-intelligence#deepmind
THE GLOBE AND MAIL

The bloated CPP Investment Board is trounced by its own benchmarks – again

For two decades, managers have tried – and failed – to beat the markets…

11 views ·
#pension#investment#finance
R/HARDWARE

DOA: Cyberpower Pre-Built Gaming PC Doesn't Even Turn On | Review, Thermals, & Benchmarks

11 views ·
R/OPENAI

Gemini 3.5 flash beating gpt 5.5 a bigger and more pricer model in agentic benchmarks (second image is from zapier automation benchmarks)

9 views ·
TECHRADAR

What AI coding benchmarks still miss about software quality

Passing tests don't tell the whole story — your AI codebase may be quietly rotting…

13 views ·
#ai#software#coding
PHORONIX

Initial Benchmarks Of The SpacemiT K3 RVA23 RISC-V CPU With The K3 Pico-ITX

15 views ·
DEV.TO (TOP)

Honest Perf Benchmarks for a Paid-API Compiler

Four PRs, three releases, and a benchmark suite that won't lie to you: seeded-RNG corpora, double-gated Claude scenarios, and skipped-but-recorded records.…

10 views ·
#typescript#benchmarking#api
GOOGLE NEWS

China leaves lending benchmarks unchanged for 12th month in May - Reuters

China leaves lending benchmarks unchanged for 12th month in May Reuters…

15 views ·
R/JAVA

Thanks to feedback from here I refactored my string pipeline library to focus more on CodePoint operations. The allocation reduction ended up improving benchmarks way more than I expected. <3 Thanks again.

10 views ·
DEV.TO (TOP)

Fix LCP, INP & CLS in 2026: The Complete Core Web Vitals Guide (With Real Benchmarks)

TL;DR Core Web Vitals (LCP, INP, CLS) directly impact your SEO rankings, bounce rates, and...…

13 views ·
#webperf#seo#javascript
DEV.TO (TOP)

Your benchmarks are lying to you, and your judge is to blame!

Last week I published a benchmark comparing six models across eleven agent skills. The numbers in...…

9 views ·
#ai#benchmarking#evaluation
R/PERSONALFINANCE

How does an expected pension impact the standard "save x times your salary by y age" retirement benchmarks?

11 views ·
DEV.TO (TOP)

Benchmarks- Kubernetes MCP Servers Passed. That Was Not Enough.

Kubernetes MCP servers passed our live benchmark. That was not the interesting part. The interesting...…

8 views ·
#kubernetes#ai#benchmark
R/LOCALLLAMA

Big new memory tool with local benchmarks

17 views ·
R/LOCALLLAMA

I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here's how

16 views ·
DEV.TO (TOP)

GPU Hardware & Driver Update: RTX 5090 Benchmarks, llama.cpp MTP, Windows 11 Fix

GPU Hardware &amp; Driver Update: RTX 5090 Benchmarks, llama.cpp MTP, Windows 11 Fix ...…

14 views ·
#gpu#nvidia#windows
DEV.TO (TOP)

Aggregate Benchmarks Lie. Here's What 700 AI Functions Look Like by Security Domain.

Part 3 ranked 5 AI models by overall vulnerability rate. But when we broke the data down by security domain — database, auth, file I/O, command execution — the rankings inverted. T…

10 views ·
#ai#security#benchmarking
SPACETIMEDB

Let's Talk about Benchmarks

SpacetimeDB is a real-time backend framework and database for apps and games. Write server logic in TypeScript, C#, C++, or Rust with automatic client synchronization.…

10 views ·
#databases#benchmarking#performance
TOM'S HARDWARE

The Core Ultra 7 270K was too good, so Intel scrapped the flagship Core Ultra 9 290K Plus — benchmarks of the 290K prototype find slim 2% faster performance in gaming and applications

The Core Ultra 9 290K Plus just wouldn't have made sense.…

11 views ·
#cpus#intel#hardware
FIRETHERING

Xiaomi releases MiMo-v2.5 Family weights with strong coding and agent benchmarks

Peking University gives its computer science students a compiler project every semester. Build a complete SysY compiler in Rust including lexer, parser, abstract syntax tree, IR co…

9 views ·
#technology#artificial intelligence#open source