Search: "llm evaluation" — WeSearch Press

ARXIV CS.AI

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per sc…

Wed, 29 Apr 2026 04:04:25 GMT · 7 views

ARXIV.ORG

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empir…

Tue, 28 Apr 2026 04:13:21 GMT · 7 views

ARXIV.ORG

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in …

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable…

Tue, 28 Apr 2026 04:13:21 GMT · 7 views

ARXIV.ORG

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (…

Tue, 28 Apr 2026 04:13:21 GMT · 8 views

DEV.TO (TOP)

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Part 2 of a series on testing AI systems in production In Part 1, we explored why testing AI...…

Thu, 30 Apr 2026 14:39:43 GMT · 7 views

ARXIV CS.AI

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expre…

Wed, 29 Apr 2026 04:04:25 GMT · 8 views

ARXIV CS.AI

IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review

Scientific research relies on accurate information retrieval from literature to support analytical decisions. In this work, we introduce a new task, INformation reTRieval through literAture reVIEW (In…

Wed, 29 Apr 2026 04:04:25 GMT · 5 views

NOWARP

Compiler Testing – Part 1: Coverage-Guided Fuzzing with Grammars and LLMs

Compiler fuzzing for small languages is a specific problem — few optimization passes, tiny corpora, thin docs. This post covers how coverage-guided fuzzing and LLM-assisted tooling adapt to smart-cont…

Wed, 29 Apr 2026 00:04:22 GMT · 7 views

ARXIV.ORG

Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach

Automatically generating formal ontologies from unstructured natural language remains a central challenge in knowledge engineering. While large language models (LLMs) show promise, it remains unclear …

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

LEGO: An LLM Skill-Based Front-End Design Generation Platform

Existing LLM-based EDA agents are often isolated task-specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a un…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

Autonomous multi-agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is groun…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous…

Tue, 28 Apr 2026 04:13:21 GMT · 8 views

ARXIV.ORG

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

OpenGame: Open Agentic Coding for Games

Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across ma…

Wed, 29 Apr 2026 05:34:25 GMT · 8 views

ARXIV CS.AI

PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

Multimodal Large Language Models (MLLMs) rely on multimodal pre-training over diverse data sources, where different datasets often induce complementary cross-modal alignment capabilities. Model mergin…

Wed, 29 Apr 2026 04:04:25 GMT · 5 views

ARXIV CS.AI

Structure Guided Retrieval-Augmented Generation for Factual Queries

Retrieval-Augmented Generation (RAG) has been proposed to mitigate hallucinations in large language models (LLMs), where generated outputs may be factually incorrect. However, existing RAG approaches …

Wed, 29 Apr 2026 04:04:25 GMT · 5 views

ARXIV.ORG

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (…

Wed, 29 Apr 2026 01:52:36 GMT · 16 views

ARXIV.ORG

A Systematic Approach for Large Language Models Debugging

Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

Refereeing is vital in sports, where fair, accurate, and explainable decisions are fundamental. While intelligent assistant technologies are being widely adopted in soccer refereeing, current AI-assis…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automate…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

Financial AI systems must produce answers grounded in specific regulatory filings, yet current LLMs fabricate metrics, invent citations, and miscalculate derived quantities. These errors carry direct …

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system r…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Adaptive ToR: Complexity-Aware Tree-Based Retrieval for Pareto-Optimal Multi-Intent NLU

Multi-intent natural language understanding requires retrieval systems that simultaneously achieve high accuracy and computational efficiency, yet existing approaches apply either uniform single-step …

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Evaluating whether AI models would sabotage AI safety research

We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluati…

Tue, 28 Apr 2026 04:13:21 GMT · 7 views

Results for "llm evaluation".

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review

Compiler Testing – Part 1: Coverage-Guided Fuzzing with Grammars and LLMs

Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach

LEGO: An LLM Skill-Based Front-End Design Generation Platform

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

OpenGame: Open Agentic Coding for Games

PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

Structure Guided Retrieval-Augmented Generation for Factual Queries

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

A Systematic Approach for Large Language Models Debugging

SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

Adaptive ToR: Complexity-Aware Tree-Based Retrieval for Pareto-Optimal Multi-Intent NLU

Evaluating whether AI models would sabotage AI safety research

Or browse by topic