WeSearch
Hub / Tags / Evaluation
TAG · #EVALUATION

Evaluation coverage.

Every story in the WeSearch catalog tagged with #evaluation, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

40 stories tagged with #evaluation, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Evaluation"

RELATED TAGS
#ai9#ml8#llm-evaluation7#model-evaluation5#natural-language-processing5#ai-evaluation5#ai-safety3#anthropic3#benchmarking3#swe-bench2#open-source2#vision-language-models2
ARXIV.ORG

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks w…

3 views ·
#artificial intelligence#ai safety#machine learning
DEV.TO (TOP)

Tenacious-Bench: Building a Sales Domain Evaluation Benchmark When No Dataset Exists

The Gap General-purpose LLM benchmarks like τ²-Bench evaluate task completion in retail...…

5 views ·
#machine learning#llm evaluation#sales automation
NEWS

Fifa ramps up efforts to sell luxury World Cup hospitality tickets after revenue re-evaluation

5 views ·
GITHUB

LLM-eval-kit: Distributed LLM evaluation framework (v0.3.0)

A lightweight, modular toolkit for evaluating and benchmarking Large Language Models with focus on reasoning quality, consistency, and error detection. - benmeryem-tech/llm-eval-ki…

11 views ·
#llm evaluation#framework#open source
TOWARDS DATA SCIENCE

Why Powerful Machine Learning Is Deceptively Easy

Or why what appears powerful can be methodologically fragile The post Why Powerful Machine Learning Is Deceptively Easy appeared first on Towards Data Science .…

6 views ·
#machine learning#data science#methodology
DEV COMMUNITY

Skills Without Evals Are Just Markdown and Hope

TL;DR. I built an Anthropic Agent Skill for @ngrx/signals and ran it through the full eval pipeline: capability A/B benchmarks, token and wall-time accounting, and a description-op…

3 views ·
#ai#angular#ngrx
YAHOO SPORTS

NFL analyst 'blown away' by Commanders' 2026 NFL Draft haul

One draft analyst said he "loved" Washington's 2026 NFL Draft haul.…

14 views ·
#nfl draft#washington commanders#draft analysis
SIMON WILLISON'S WEBLOG

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

Our evaluation of OpenAI's GPT-5.5 cyber capabilities The UK's AI Security Institute previously evaluated Claude Mythos : now they've evaluated GPT-5.5 for finding security vulnera…

16 views ·
DETAIL

Benchmarking a Bug Scanner

We ran a tournament pitting Detail's findings against thousands of comments from code review bots.…

5 views ·
#bug scanner#code review#software quality
CONTRALABS

The Human Creativity Benchmark – Evaluating Generative AI in Creative Work

The frontier human data and evaluation lab for creative AI. 1.5M+ verified creative experts setting the benchmark for style, tone, and taste with next-gen creative tools.…

10 views ·
#ai evaluation#creative work#generative ai
JOE CARLSMITH

Restraining AI development for the sake of safety

My take on slowing down AI.…

8 views ·
#ai safety#capability restraint#alignment problem
DEV.TO (TOP)

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Part 2 of a series on testing AI systems in production In Part 1, we explored why testing AI...…

8 views ·
#ai#llm#machinelearning
JUDGEKIT

JudgeKit: Generate LLM-as-Judge prompts grounded in published research

Free LLM-as-a-Judge prompt generator. Build G-Eval, Prometheus, and binary judge prompts grounded in 30+ peer-reviewed papers. No signup.…

7 views ·
#llm evaluation#prompt engineering#ai research
WASHINGTON EXAMINER

Iranian currency hits all-time low amid US blockade

Iran's rial, which has already fared poorly over the past year, has entered a precipitous decline due to the U.S. naval blockade.…

6 views ·
#iranian rial#economic crisis#us blockade
REAL CLEAR DEFENSE

Brain Function Evaluations To Be Part of Marine Health Records

As the U.S. Marine Corps faces congressionally mandated deadlines to evaluate the brain injury impacts of weapons blasts on the force and start implementing mitigation measures, ch…

8 views ·
SCIENCEDAILY

This AI knew the answers but didn’t understand the questions

For decades, psychologists have debated whether the human mind can be explained by one unified theory or must be broken into separate parts like memory and attention. A recent AI m…

8 views ·
#artificial intelligence#cognitive science#language understanding
HUGGING FACE - BLOG

AI evals are becoming the new compute bottleneck

A Blog post by EvalEval Coalition on Hugging Face…

30 views ·
#ai evaluation#compute costs#agent benchmarks
YCOMBINATOR

Ask HN: How do you differentiate with AI coding interviews?

26 views ·
#ai coding tools#technical interviews#candidate evaluation
ARXIV CS.AI

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring …

8 views ·
#artificial intelligence#clinical evaluation#rubrics
ARXIV CS.AI

ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

Time-series anomaly detectors are commonly compared on workstation-class hardware under unconstrained execution. In-vehicle monitoring, however, requires predictable latency and st…

9 views ·
#machine learning#anomaly detection#automotive systems
ARXIV CS.AI

StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

We introduce StratRAG, an open-source retrieval evaluation dataset for benchmarking Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning tasks under realistic, noisy…

10 views ·
#retrieval-augmented generation#multi-hop retrieval#information retrieval
ARXIV CS.AI

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus …

9 views ·
#handwritten math ocr#vision-language models#educational ai
YAHOO SPORTS

NFL Draft power rankings

Everybody’s a winner on NFL Draft night, just ask them. I’m not here to do that. I’m ranking teams on what they actually did with the board in front of them, relative to the rest o…

7 views ·
#nfl draft#power rankings#2026 nfl draft
REDDIT

Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation

10 views ·
ARXIV.ORG

A Systematic Approach for Large Language Models Debugging

Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging …

6 views ·
#large language models#debugging#artificial intelligence
ARXIV.ORG

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a…

7 views ·
#artificial intelligence#llm evaluation#bias mitigation
ARXIV.ORG

Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic++ Graph

Graph-based anti-money laundering (AML) systems on blockchain networks can score suspicious activity at two granularity levels -- transactions or actor addresses -- yet compliance …

7 views ·
#anti-money laundering#graph-based detection#blockchain analytics
ARXIV.ORG

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended…

7 views ·
#legal reasoning#large language models#japanese bar exam
ARXIV.ORG

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale heal…

9 views ·
#hospitalization forecasting#large language models#decision support
ARXIV.ORG

An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data bec…

6 views ·
#artificial intelligence#healthcare#machine learning
ARXIV.ORG

Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shi…

7 views ·
#artificial intelligence#runtime evaluation#failure analysis
ARXIV.ORG

CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of fin…

8 views ·
#ct report generation#factual consistency#benchmarking
ARXIV.ORG

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuo…

7 views ·
#ai evaluation#continuous monitoring#deployment metrics
ARXIV.ORG

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Ju…

8 views ·
#artificial intelligence#sustainable travel#llm evaluation
ARXIV.ORG

Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundr…

8 views ·
#artificial intelligence#clinical decision support#multiple myeloma
ARXIV.ORG

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, th…

6 views ·
#artificial intelligence#machine learning#natural language processing
ARXIV.ORG

A systematic evaluation of vision-language models for observational astronomical reasoning tasks

Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across …

8 views ·
#astronomy#vision-language models#artificial intelligence
ARXIV.ORG

Evaluating whether AI models would sabotage AI safety research

We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two co…

7 views ·
#ai safety#model evaluation#sabotage behavior
EUGENEYAN.COM

Product Evals in Three Simple Steps

Label some data, align LLM-evaluators, and run the eval harness with each change.…

4 views ·
#product evaluation#llm evaluation#machine learning
OPENAI

SWE-bench Verified no longer measures frontier coding capabilities

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.…

13 views ·
#swe-bench#ai benchmarking#code generation