Search: "reasoning models"

ARXIV.ORG

Does Point Cloud Boost Spatial Reasoning of Large Language Models?

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the role of point clouds in 3D spatial …

Tue, 28 Apr 2026 14:55:00 GMT · 4 views

ARXIV.ORG

Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the m…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final ans…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

A systematic evaluation of vision-language models for observational astronomical reasoning tasks

Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities r…

Tue, 28 Apr 2026 04:13:21 GMT · 7 views

Do the "*Claude-4.6-Opus-Reasoning-Distilled" really bring something new to the original models?

No offense to the fine-tune model providers, just curious. IMO the original models were already trained on massive amount of high quality data, so why bother with this fine-tune? Just to make the mode…

Tue, 28 Apr 2026 09:51:28 GMT · 8 views

ARXIV.ORG

Beyond 80/20: High-Entropy Minority Tokens Drive Effective RL for LLM Reasoning

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well …

Thu, 30 Apr 2026 07:28:16 GMT · 8 views

ARXIV CS.AI

Probing Visual Planning in Image Editing Models

Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

LOCALLLAMA

Why isn’t LLM reasoning done in vector space instead of natural language?

Why don’t LLMs use explicit vector-based reasoning instead of language-based chain-of-thought? What would happen if they did? Most LLM reasoning we see is expressed through language: step-by-step text…

Wed, 29 Apr 2026 01:52:35 GMT · 6 views

VENTUREBEAT

How to build custom reasoning agents with a fraction of the compute

Training AI reasoning models demands resources that most enterprise teams do not have. Engineering teams are often forced to choose between distilling knowledge from large, expensive models or relying…

Wed, 29 Apr 2026 01:52:14 GMT · 13 views

open models keep catching up and the frontier keeps moving. at some point one of those has to stop

a year ago there was a clear tier gap. now i'm less sure, but not in the way i expected. the tasks where open-weight models have genuinely caught up are real: coding assistance, summarization, instruc…

Tue, 28 Apr 2026 07:52:12 GMT · 6 views

ARXIV.ORG

The Power of Power Law: Asymmetry Enables Compositional Reasoning

Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a un…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

A Systematic Approach for Large Language Models Debugging

Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

ARXIV.ORG

Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis

Large language model (LLM) agents are increasingly tasked with complex real-world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instabili…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic …

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

Chain-of-Thought (CoT) prompting has emerged as a simple and effective way to elicit step-by-step solutions from large language models (LLMs). However, CoT reasoning can be unstable across runs on lon…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in …

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal …

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

ARXIV.ORG

Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols

As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and for…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Evaluating whether AI models would sabotage AI safety research

We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluati…

Tue, 28 Apr 2026 04:13:21 GMT · 6 views

FIRETHERING

Granite 4.1: IBM's 8B Model Matching 32B MoE

IBM just released Granite 4.1, a family of open source language models built specifically for enterprise use. Three sizes, Apache 2.0 licensed and trained on 15 trillion tokens with a level of pipelin…

Thu, 30 Apr 2026 11:01:22 GMT · 2 views

ARXIV.ORG

Estimating Black-Box LLM Parameter Counts via Factual Capacity

Closed-source frontier labs do not disclose parameter counts, and the standard alternative -- inference economics -- carries $2\times$+ uncertainty from hardware, batching, and serving-stack assumptio…

Thu, 30 Apr 2026 09:49:09 GMT · 2 views

ARXIV CS.AI

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questio…

Wed, 29 Apr 2026 04:04:25 GMT · 5 views

ARXIV CS.AI

Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expre…

Wed, 29 Apr 2026 04:04:25 GMT · 7 views

ARXIV CS.AI

BiTA: Bidirectional Gated Recurrent Unit-Transformer Aggregator in a Temporal Graph Network Framework for Alert Prediction in Computer Networks

Proactive alert prediction in computer networks is critical for mitigating evolving cyber threats and enabling timely defensive actions. Temporal Graph Neural Networks (TGNs) provide a principled fram…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

Parameter-Efficient Fine-Tuning (PEFT) has become the standard for adapting large language models (LLMs). In this work we challenge the wide-spread assumption that parameter efficiency equates memory …

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV CS.AI

See No Evil: Semantic Context-Aware Privacy Risk Detection for AR

Augmented reality (AR) systems pose unique privacy risks due to their continuous capture of visual data. Existing AR privacy frameworks lack semantic understanding of visual content, limiting their ef…

Wed, 29 Apr 2026 04:04:25 GMT · 4 views

ARXIV.ORG

Mitigating Belief Inertia via Active Intervention in Embodied Agents

Recent advancements in large language models (LLMs) have enabled agents to tackle complex embodied tasks through environmental interaction. However, these agents still make suboptimal decisions and pe…

Tue, 28 Apr 2026 08:54:13 GMT · 4 views

ARXIV.ORG

FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit…

Tue, 28 Apr 2026 04:13:21 GMT · 5 views

ARXIV.ORG

Don't Make the LLM Read the Graph: Make the Graph Think

We investigate whether explicit belief graphs improve LLM performance in cooperative multi-agent reasoning. Through 3,000+ controlled trials across four LLM families in the cooperative card game Hanab…

Tue, 28 Apr 2026 04:13:21 GMT · 4 views

Results for "reasoning models".

Does Point Cloud Boost Spatial Reasoning of Large Language Models?

Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

A systematic evaluation of vision-language models for observational astronomical reasoning tasks

Do the "*Claude-4.6-Opus-Reasoning-Distilled" really bring something new to the original models?

Beyond 80/20: High-Entropy Minority Tokens Drive Effective RL for LLM Reasoning

Probing Visual Planning in Image Editing Models

Why isn’t LLM reasoning done in vector space instead of natural language?

How to build custom reasoning agents with a fraction of the compute

open models keep catching up and the frontier keeps moving. at some point one of those has to stop

The Power of Power Law: Asymmetry Enables Compositional Reasoning

A Systematic Approach for Large Language Models Debugging

Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols

Evaluating whether AI models would sabotage AI safety research

Granite 4.1: IBM's 8B Model Matching 32B MoE

Estimating Black-Box LLM Parameter Counts via Factual Capacity

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

BiTA: Bidirectional Gated Recurrent Unit-Transformer Aggregator in a Temporal Graph Network Framework for Alert Prediction in Computer Networks

Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

See No Evil: Semantic Context-Aware Privacy Risk Detection for AR

Mitigating Belief Inertia via Active Intervention in Embodied Agents

FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

Don't Make the LLM Read the Graph: Make the Graph Think

Or browse by topic