AI Research & ML Papers · Page 4

arXiv cs.AI

When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning

The paper titled 'When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning' explores the balance between…

6/3/2026 · 3 min read · 45 views

arXiv cs.AI

Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

The paper introduces derivation graphs to enhance the understanding of do-calculus reasoning. These graphs help in…

6/3/2026 · 2 min read · 59 views

arXiv cs.AI

Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs

The paper introduces Code-on-Graph (CoG), a new framework for integrating Large Language Models (LLMs) with Knowledge…

6/3/2026 · 3 min read · 51 views

arXiv cs.AI

Dynamic Objective Selection with Safeguards and LLM Oversight for Financial Decision-Making

The paper introduces Dynamic Objective Selection with Safeguards (DOSS) for financial decision-making. DOSS aims to…

6/3/2026 · 3 min read · 48 views

arXiv cs.AI

SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

The article introduces SkillPyramid, a framework designed to enhance the skill consolidation of self-evolving AI…

6/3/2026 · 2 min read · 52 views

arXiv cs.AI

The DeepSpeak-Agentic Dataset

The DeepSpeak-Agentic dataset consists of over 37 hours of semi-structured conversations between humans and AI agents.…

6/3/2026 · 2 min read · 31 views

arXiv cs.AI

EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

EvoDrive is a new framework designed for generating safety-critical scenarios in autonomous driving systems. It…

6/3/2026 · 3 min read · 39 views

arXiv cs.AI

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

The paper introduces ChemCoTBench-V2, a benchmark designed for evaluating chemical reasoning in large language models.…

6/3/2026 · 3 min read · 39 views

arXiv cs.AI

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

The paper introduces NovelAPIBench, a dynamic benchmark designed to evaluate large language models' ability to use…

6/3/2026 · 3 min read · 39 views

arXiv cs.AI

Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

The paper discusses advancements in propositional defeasible standpoint logic, focusing on non-monotonic entailment.…

6/3/2026 · 3 min read · 36 views

arXiv cs.AI

Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

A recent study investigates gender-dependent disparities in medical triage recommendations made by large language…

6/3/2026 · 3 min read · 37 views

arXiv cs.AI

TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning

The paper introduces TSQAgent, a framework designed to improve the assessment of time series data quality using large…

6/3/2026 · 3 min read · 38 views

arXiv cs.AI

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

The paper introduces a framework to improve instruction following in Large Reasoning Models (LRMs) by addressing the…

6/3/2026 · 3 min read · 43 views

arXiv cs.AI

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

The paper discusses a new approach to optimize coding agents by reducing input-token costs. It introduces a middleware…

6/3/2026 · 3 min read · 39 views

arXiv cs.AI

From Prompt to Service: An SLM-Based Agent Orchestration Gateway for AI-Driven Virtual Worlds

The paper introduces an SLM-based Agent Orchestration Gateway designed for AI-driven virtual worlds. This gateway…

6/3/2026 · 3 min read · 37 views

arXiv cs.AI

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

The paper introduces SAGE, a framework for evaluating socialized evolution in agent ecosystems. It compares two…

6/3/2026 · 3 min read · 33 views

arXiv cs.AI

Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

The paper presents a new compositional authorization framework for managing delegation and scope in agentic AI…

6/3/2026 · 3 min read · 46 views

arXiv cs.AI

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

The paper introduces ThoughtFold, a framework designed to improve the efficiency of Large Reasoning Models (LRMs) by…

6/3/2026 · 3 min read · 35 views

arXiv cs.AI

A formal definition and meta-model for a machine theory of mind

The paper presents a formal definition and meta-model for the Machine Theory of Mind. It integrates insights from…

6/3/2026 · 2 min read · 41 views

arXiv cs.AI

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

StepFinder is a new framework designed for failure attribution in multi-agent systems. It aims to improve the…

6/3/2026 · 3 min read · 42 views

arXiv cs.AI

DMF: A Deterministic Memory Framework for Conversational AI Agents

The Deterministic Memory Framework (DMF) aims to enhance memory systems for conversational AI agents. It replaces…

6/3/2026 · 3 min read · 43 views

arXiv cs.AI

What Makes Interaction Trajectories Effective for Training Terminal Agents?

The paper investigates the effectiveness of interaction trajectories in training terminal agents. It reveals that…

6/3/2026 · 3 min read · 38 views

arXiv cs.AI

CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

The article introduces CP-Agent, a multimodal large language model designed for cellular morphological profiling under…

6/3/2026 · 3 min read · 48 views

arXiv cs.AI

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

The paper introduces InfoMem, a new reward mechanism designed for training long-context memory agents in artificial…

6/3/2026 · 3 min read · 39 views

arXiv cs.AI

The Violation Situation Pattern: A Knowledge-Graph Pattern for Compliance Violations

The Violation Situation Pattern (VSP) is a new knowledge-graph pattern designed to improve compliance violation…

6/3/2026 · 3 min read · 37 views

arXiv cs.AI

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

The article discusses the challenges of benchmark auditing in artificial intelligence, particularly regarding…

6/3/2026 · 3 min read · 44 views

arXiv cs.AI

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

The article discusses LEAP, a new framework designed to enhance the capabilities of Large Language Models (LLMs) in…

6/3/2026 · 3 min read · 40 views

arXiv cs.AI

A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

The paper presents a negative result regarding cross-model activation transfer in a multi-hop reasoning setting using…

6/3/2026 · 2 min read · 39 views

arXiv cs.AI

Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering

The article discusses a novel approach for enhancing Visual Question Answering (VQA) by distilling rules from Large…

6/3/2026 · 3 min read · 36 views

arXiv cs.AI

Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

The study investigates whether real-world datasets contain natural experiments, which are implicit interventions…

6/3/2026 · 3 min read · 45 views

arXiv cs.AI

Solipsistic Superintelligence is Unlikely to be Cooperative

A recent paper argues that superintelligence developed from a solipsistic approach to AI design is unlikely to be…

6/3/2026 · 3 min read · 49 views

arXiv cs.AI

Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

The article discusses a new framework called the Pre-Reasoning Perception Framework (PRPF) designed to enhance…

6/3/2026 · 3 min read · 40 views

arXiv cs.AI

Effect of Demographic Bias on Skin Lesion Classification

The study investigates the impact of demographic bias on skin lesion classification using ResNet-based models. It…

6/3/2026 · 3 min read · 47 views

arXiv cs.AI

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

MedCUA-Bench is a newly introduced benchmark designed specifically for clinical computer-use agents. It aims to…

6/3/2026 · 3 min read · 36 views

arXiv cs.AI

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

The article introduces ClinicalMC, a benchmark designed for evaluating large language models in multi-course clinical…

6/3/2026 · 3 min read · 41 views

arXiv cs.AI

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

The article introduces GTBench, a benchmark designed to evaluate large language models (LLMs) as mathematical research…

6/3/2026 · 3 min read · 37 views

arXiv cs.AI

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

The paper introduces a new framework called Think-Before-Speak (TBS) for multi-agent social simulation. TBS separates…

6/3/2026 · 3 min read · 44 views

arXiv cs.AI

Uncertainty-Aware Clarification in LLM Agents with Information Gain

The article discusses a new framework for Large Language Model (LLM) agents that aims to improve their performance in…

6/3/2026 · 2 min read · 43 views

arXiv cs.AI

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

EvoTrainer is a new autonomous training framework designed for co-evolving LLM policies and training harnesses. It…

6/3/2026 · 2 min read · 43 views

arXiv cs.AI

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

The paper titled 'DeskCraft' introduces a new benchmark for evaluating desktop agents in professional workflows that…

6/3/2026 · 3 min read · 38 views

arXiv cs.AI

From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting

A new framework has been developed to enhance time series forecasting by integrating news articles. This approach…

6/3/2026 · 3 min read · 50 views

arXiv cs.AI

Decomposing how prompting steers behavior

The paper titled 'Decomposing how prompting steers behavior' explores how prompting influences the internal…

6/3/2026 · 3 min read · 49 views

arXiv cs.AI

The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

The article discusses a new approach to budget allocation for Large Language Models (LLMs) based on economic…

6/3/2026 · 2 min read · 42 views

arXiv cs.AI

DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

The paper introduces DeltaMem, a framework designed to enhance memory management in Large Language Model (LLM) agents.…

6/3/2026 · 3 min read · 47 views

arXiv cs.AI

CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection

The article discusses a new framework called CORE, which stands for Conflict-Oriented Reasoning, designed to enhance…

6/3/2026 · 3 min read · 49 views

arXiv cs.AI

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

The paper introduces SkillDAG, a novel approach for selecting skills in large language models (LLMs) by modeling…

6/3/2026 · 3 min read · 40 views

arXiv cs.AI

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

The paper introduces ToolGate, a system designed to improve the efficiency of tool-augmented vision-language agents.…

6/3/2026 · 3 min read · 38 views

arXiv cs.AI

RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

The paper introduces RelGT-AC, a new model designed for autocomplete tasks in relational databases. It enhances the…

6/3/2026 · 3 min read · 46 views

arXiv cs.AI

TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

TriEval is a new pipeline designed to assess bias, toxicity, and truthfulness in large language models (LLMs)…

6/3/2026 · 3 min read · 48 views

arXiv cs.AI

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

The article discusses AuditFlow, a new framework designed for structured financial reporting verification. It utilizes…

6/3/2026 · 3 min read · 40 views

Ai Research news.

When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning

Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs

Dynamic Objective Selection with Safeguards and LLM Oversight for Financial Decision-Making

SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

The DeepSpeak-Agentic Dataset

EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

From Prompt to Service: An SLM-Based Agent Orchestration Gateway for AI-Driven Virtual Worlds

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

A formal definition and meta-model for a machine theory of mind

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

DMF: A Deterministic Memory Framework for Conversational AI Agents

What Makes Interaction Trajectories Effective for Training Terminal Agents?

CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

The Violation Situation Pattern: A Knowledge-Graph Pattern for Compliance Violations

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering

Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

Solipsistic Superintelligence is Unlikely to be Cooperative

Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

Effect of Demographic Bias on Skin Lesion Classification

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

Uncertainty-Aware Clarification in LLM Agents with Information Gain

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting

Decomposing how prompting steers behavior

The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

Sources in Ai Research

Other categories