#benchmarking — Tagged Stories

Every story in the WeSearch catalog tagged with #benchmarking, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

60 stories tagged with #benchmarking, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag → or search "Benchmarking"

RELATED TAGS

#ai49 #technology12 #ml8 #programming5 #api5 #security4 #typescript3 #hardware3 #gpu2 #python2 #performance2 #rust2

GITHUB

Benchmarking Opus 5 on SlopCodeBench

Contribute to humanlayer/advanced-context-engineering-for-coding-agents development by creating an account on GitHub.…

7 views · Mon, 27 Jul 2026 22:37:52 GMT

#opus #slopcodebench

ARXIV.ORG

Benchmarking Confidential GPU Inference on NVIDIA H100 under Intel TDX

arXiv:2607.19353v1 Announce Type: new Abstract: Confidential computing is becoming a practical deployment requirement for AI inference workloads that process sensitive inputs or pr…

Benchmarking coverage.

Benchmarking Opus 5 on SlopCodeBench

Benchmarking Confidential GPU Inference on NVIDIA H100 under Intel TDX

OmniMapBench: Benchmarking Visual-Centric Reasoning on Diverse Map Documents

REFORGE: A Method for Benchmarking LLMs' Reverse Engineering Capabilities in Decompiled Binary Function Naming

LongMedBench: Benchmarking Medical Agents for Long-Horizon Clinical Decision-Making

OpenFinGym: A Verifiable Multi-Task Gym Environment for Evaluating Quant Agents

Will It Mythos?

Today Marks 22 Years Of Phoronix For Linux Hardware Testing & Benchmarking

Show HN: Hive Trust – Ed25519-signed benchmarks for every AI inference primitive

Cross Cloud A2A Agent Benchmarking

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

We Benchmarked Our Open Source Memory Tool Against a Microsoft Research Paper

Benchmarking time-series databases for ecommerce infrastructure monitoring

From Benchmarketing to Benchmaxxing

CVE-Bench: testing LLM agents on real-world vulnerability patches

BenchBench

CostBench: an open benchmark for data warehouse cost-performance

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Benchmarking LLMs for Web Tasks

Nvidia offers restricted access to Vera CPU in first round of Linux benchmarks - 88-core monster competes with or beats Epyc and Xeon in selected tests

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Constraint acquisition needs better benchmarks

Revisiting Benchmarking- Building a Rust A2A Agent

[std-proposals] Benchmarking using the standard library as a module

How I Slashed My AI API Bill by 92% in 2026 — A Cost Optimizer's Speed Benchmark Guide

Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes

AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

Show HN: AgentToolBench-Code – security benchmark for AI coding agents

Benchmarking LLM Structured Outputs

Rust Concepts: Serde, Error Handling, Benchmarking & Workspaces (Part 6)

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

FastKernels: Benchmarking GPU Kernel Generation in Production

Design and Report Benchmarks for Knowledge Work

MetalBench – Benchmark for Apple Silicon's Metal Shading Lang

Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes

Evaluating Spec CPU2026

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

How I Slashed My AI API Bill by 92% in 2026 — A Cost Optimizer's Speed Benchmark Guide

MLX Vulkan Back End

InferenceBench: A Benchmark for Open-Ended Inference Optimization by AI Agents

Benchmarking AWS Nova on Log Data: How It Compares to ChatGPT-3.5

Browse more