Reinforcement Learning coverage.

24 views · Mon, 13 Jul 2026 04:00:00 GMT

Multimodal Reward Hacking in Reinforcement Learning

Reinforcement learning (RL) is increasingly used to align multimodal large language models (MLLMs), but higher rewards do not always imply better task performance. This risk is amp…

#multimodal #reward #hacking

40 views · Wed, 03 Jun 2026 04:00:00 GMT

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading docume…

45 views · Wed, 03 Jun 2026 04:00:00 GMT

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and s…

35 views · Fri, 29 May 2026 22:39:02 GMT

Explainable Causal Reinforcement Learning for planetary geology survey missions with embodied agent feedback loops

It was 3 AM, and I was staring at a terminal window filled with telemetry data from a simulated Mars rover. The reinforcement learning (RL) agent I had trained overnight had just c…

#ai #planetary science

24 views · Thu, 28 May 2026 22:05:35 GMT

Why I built the HuggingFace for RL agents — and why RL needs one

Showcase Video If you've ever tried MineRL or OpenAI Five, you know the feeling. The environment...…

#technology #ai

31 views · Wed, 27 May 2026 04:00:00 GMT

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes …

40 views · Wed, 27 May 2026 04:00:00 GMT

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tamperin…

#artificial intelligence #machine learning #bias

35 views · Wed, 27 May 2026 04:00:00 GMT

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions…

44 views · Wed, 27 May 2026 04:00:00 GMT

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools a…

#artificial intelligence #medical

42 views · Wed, 27 May 2026 04:00:00 GMT

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely op…

#artificial intelligence #multi-agent systems

39 views · Wed, 27 May 2026 04:00:00 GMT

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierarchical counterparts by discovering and re…

36 views · Tue, 26 May 2026 15:58:26 GMT

ARXIV.ORG

Polar: Agentic RL on Any Harness at Scale

Reinforcement learning for language agents increasingly depends on custom harnesses that manage long-running context, multi-turn tool use and multi-agent orchestration. However, po…

#machine learning #software engineering

25 views · Tue, 26 May 2026 04:00:00 GMT

Credit Assignment with Resets in Language Model Reasoning

Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all toke…

#artificial intelligence #language models

32 views · Tue, 26 May 2026 04:00:00 GMT

Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmanned combat aerial vehicles (UCAVs) faces s…

#artificial intelligence #air combat

32 views · Tue, 26 May 2026 04:00:00 GMT

ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shift…

#artificial intelligence #task scheduling

27 views · Tue, 26 May 2026 04:00:00 GMT

CoRe-Code: Collaborative Reinforcement Learning for Code Generation

Large language models (LLMs) have achieved strong performance in code generation, but most methods rely on autoregressive decoding without global planning, often leading to locally…

#artificial intelligence #code generation

44 views · Tue, 26 May 2026 04:00:00 GMT

Emission-Aware Reinforcement Learning for Sustainable Electric Vehicle Charging and Carbon Dioxide Reduction Under Varying Renewable Penetration

The rapid growth of Electric Vehicle (EV) adoption challenges power distribution networks through peak load spikes, voltage instability, and transformer overloads from uncoordinate…

#artificial intelligence #electric vehicles #renewable energy

34 views · Tue, 26 May 2026 04:00:00 GMT

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with un…

#artificial intelligence #teamwork

29 views · Tue, 26 May 2026 04:00:00 GMT

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that…

37 views · Mon, 25 May 2026 19:15:00 GMT

Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions

In the previous article, we created a reward model. In this article, we will continue exploring how...…

#ai #machinelearning #reinforcementlearning

R/MACHINELEARNING

If you use NVIDIA Isaac Sim for reinforcement learning, do you use Isaac Lab with it? Just want to get a sense of what the status quo is. [D]

32 views · Mon, 25 May 2026 07:26:30 GMT

31 views · Mon, 25 May 2026 04:00:00 GMT

Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-…

34 views · Mon, 25 May 2026 04:00:00 GMT

Curriculum reinforcement learning with measurable task representation learning

In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using t…

39 views · Mon, 25 May 2026 04:00:00 GMT

Score-Based One-step MeanFlow Policy Optimization

Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhe…

32 views · Mon, 25 May 2026 04:00:00 GMT

Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints

How network structure determines function is a fundamental question, and it can be investigated by graph ensembles with precisely controlled structural properties. Canonical approa…

#machine learning #graph theory

35 views · Mon, 25 May 2026 04:00:00 GMT

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-reali…

38 views · Mon, 25 May 2026 04:00:00 GMT

Classical State Preparation for Variational Quantum Algorithms via Reinforcement Learning

Variational Quantum Algorithms (VQAs) potentially offer a pathway to practical quantum advantage, but their optimization is heavily hindered by barren plateaus and numerous local m…

#quantum physics #artificial intelligence #machine learning

24 views · Mon, 25 May 2026 04:00:00 GMT

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smo…

33 views · Mon, 25 May 2026 04:00:00 GMT

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignm…

#artificial intelligence #gaming

32 views · Sat, 23 May 2026 19:25:30 GMT

Understanding Reinforcement Learning with Human Feedback Part 4: Teaching Models Human Preferences

In the previous article, we explored the part where we collect human preferences. In this article, we...…

#ai #machinelearning #reinforcementlearning

29 views · Fri, 22 May 2026 04:00:00 GMT

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floati…

#machine learning #language models

30 views · Fri, 22 May 2026 04:00:00 GMT

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptiv…

33 views · Fri, 22 May 2026 04:00:00 GMT

Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

Design for manufacturing plays a critical role in advanced aeroengine development, where complex components necessitate careful consideration of manufacturability. However, current…

#machine learning #manufacturing #aeroengines

31 views · Fri, 22 May 2026 04:00:00 GMT

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimiza…

24 views · Fri, 22 May 2026 04:00:00 GMT

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degrad…

#machine learning #artificial intelligence #quantization

33 views · Fri, 22 May 2026 04:00:00 GMT

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains…

#computer vision #artificial intelligence #machine learning

32 views · Fri, 22 May 2026 04:00:00 GMT

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual cl…

#machine learning #artificial intelligence #computer vision

26 views · Fri, 22 May 2026 04:00:00 GMT

Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present …

26 views · Fri, 22 May 2026 04:00:00 GMT

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants…

31 views · Fri, 22 May 2026 04:00:00 GMT

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real huma…

#machine learning #autonomous driving #pedestrian safety

42 views · Fri, 22 May 2026 04:00:00 GMT

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception …

38 views · Fri, 22 May 2026 04:00:00 GMT

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of chal…

#artificial intelligence #machine learning #mahjong

PRIMEINTELLECT

Systematic Reward Hacking and Prime Sprints

We release tunable RL templates that demonstrate reward hacking at 1B scale and introduce Prime Sprints, an open-access program with sponsored runs for community research.…

42 views · Thu, 21 May 2026 07:40:15 GMT

#reward hacking #research

VMAX

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs.…

35 views · Wed, 20 May 2026 21:11:55 GMT

30 views · Wed, 20 May 2026 04:00:00 GMT

SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, …

35 views · Wed, 20 May 2026 04:00:00 GMT

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This i…

#machine learning #artificial intelligence #scientific reasoning

35 views · Wed, 20 May 2026 04:00:00 GMT

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors requir…

30 views · Wed, 20 May 2026 04:00:00 GMT

Memory-Augmented Reinforcement Learning Agent for CAD Generation

Automatic generation of computer-aided design (CAD) models is a core technology for enabling intelligence in advanced manufacturing. Existing generation methods based on large lang…

#artificial intelligence #computer-aided design

26 views · Wed, 20 May 2026 04:00:00 GMT

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a mo…

#artificial intelligence #machine learning #security

23 views · Wed, 20 May 2026 04:00:00 GMT

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once disc…

30 views · Wed, 20 May 2026 04:00:00 GMT

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed…

28 views · Tue, 19 May 2026 04:00:00 GMT

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks,…

#artificial intelligence #open-ended generation

41 views · Tue, 19 May 2026 04:00:00 GMT

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information excha…

#artificial intelligence #machine learning #multiagent systems

25 views · Tue, 19 May 2026 04:00:00 GMT

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short …

#artificial intelligence #recommendation systems

31 views · Tue, 19 May 2026 04:00:00 GMT

Self-supervised Hierarchical Visual Reasoning with World Model

3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are ess…

#artificial intelligence #visual reasoning

46 views · Tue, 19 May 2026 04:00:00 GMT

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism t…

#artificial intelligence #machine learning #multiagent systems

31 views · Tue, 19 May 2026 04:00:00 GMT

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

This paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline, RdeepBot, which uses Monte Carlo samp…

33 views · Tue, 19 May 2026 04:00:00 GMT

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One prima…

#artificial intelligence #machine learning #image generation