Search: "agent benchmarks"

FIRETHERING

Xiaomi releases MiMo-v2.5 Family weights with strong coding and agent benchmarks

Peking University gives its computer science students a compiler project every semester. Build a complete SysY compiler in Rust including lexer, parser, abstract syntax tree, IR code generation, assem…

Tue, 28 Apr 2026 12:24:59 GMT · 4 views

ARXIV CS.AI

RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering

Reliable decision support in nuclear engineering requires traceable, domain-grounded knowledge retrieval, yet safety and risk analysis workflows remain hampered by fragmented documentation and halluci…

Wed, 29 Apr 2026 04:04:25 GMT · 5 views

ARXIV CS.AI

Complete Cyclic Subtask Graphs for Tool-Using LLM Agents: Flexibility, Cost, and Bottlenecks in Multi-Agent Workflows

Long-horizon tool-using tasks sometimes benefit from revisiting earlier subtasks for recovery and exploration, but added multi-agent workflow flexibility can also introduce coordination overhead and s…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV.ORG

Mitigating Belief Inertia via Active Intervention in Embodied Agents

Recent advancements in large language models (LLMs) have enabled agents to tackle complex embodied tasks through environmental interaction. However, these agents still make suboptimal decisions and pe…

Tue, 28 Apr 2026 08:54:13 GMT · 4 views

ARXIV.ORG

FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

Refereeing is vital in sports, where fair, accurate, and explainable decisions are fundamental. While intelligent assistant technologies are being widely adopted in soccer refereeing, current AI-assis…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missi…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framew…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

The application of large language models (LLMs) in clinical decision support faces significant challenges of "tunnel vision" and diagnostic hallucinations present in their processing unstructured elec…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV CS.AI

Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV.ORG

A Systematic Approach for Large Language Models Debugging

Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

Chain-of-Thought (CoT) prompting has emerged as a simple and effective way to elicit step-by-step solutions from large language models (LLMs). However, CoT reasoning can be unstable across runs on lon…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automate…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

SIMON WILLISON'S WEBLOG

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model Big claims from Qwen about their latest open weight model: Qwen3.6-27B delivers flagship-level agentic coding performance, surpassing the previo…

Sun, 26 Apr 2026 22:44:22 GMT · 10 views

Results for "agent benchmarks".

Xiaomi releases MiMo-v2.5 Family weights with strong coding and agent benchmarks

RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering

Complete Cyclic Subtask Graphs for Tool-Using LLM Agents: Flexibility, Cost, and Bottlenecks in Multi-Agent Workflows

Mitigating Belief Inertia via Active Intervention in Embodied Agents

FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

A Systematic Approach for Large Language Models Debugging

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

Or browse by topic