60 stories tagged with #reinforcement-learning, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.
⌘ RSS feed for this tag → or search "Reinforcement Learning"
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and s…
InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain
Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading docume…
Explainable Causal Reinforcement Learning for planetary geology survey missions with embodied agent feedback loops
It was 3 AM, and I was staring at a terminal window filled with telemetry data from a simulated Mars rover. The reinforcement learning (RL) agent I had trained overnight had just c…
Why I built the HuggingFace for RL agents — and why RL needs one
Showcase Video If you've ever tried MineRL or OpenAI Five, you know the feeling. The environment...…
Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL
Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierarchical counterparts by discovering and re…
UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems
LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely op…
Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents
Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools a…
StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions…
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tamperin…
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes …
Polar: Agentic RL on Any Harness at Scale
Reinforcement learning for language agents increasingly depends on custom harnesses that manage long-running context, multi-turn tool use and multi-agent orchestration. However, po…
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that…
Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork
In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with un…
Emission-Aware Reinforcement Learning for Sustainable Electric Vehicle Charging and Carbon Dioxide Reduction Under Varying Renewable Penetration
The rapid growth of Electric Vehicle (EV) adoption challenges power distribution networks through peak load spikes, voltage instability, and transformer overloads from uncoordinate…
CoRe-Code: Collaborative Reinforcement Learning for Code Generation
Large language models (LLMs) have achieved strong performance in code generation, but most methods rely on autoregressive decoding without global planning, often leading to locally…
ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents
Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shift…
Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat
As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmanned combat aerial vehicles (UCAVs) faces s…
Credit Assignment with Resets in Language Model Reasoning
Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all toke…
Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions
In the previous article, we created a reward model. In this article, we will continue exploring how...…
If you use NVIDIA Isaac Sim for reinforcement learning, do you use Isaac Lab with it? Just want to get a sense of what the status quo is. [D]
One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents
On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignm…
Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics
Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smo…
Classical State Preparation for Variational Quantum Algorithms via Reinforcement Learning
Variational Quantum Algorithms (VQAs) potentially offer a pathway to practical quantum advantage, but their optimization is heavily hindered by barren plateaus and numerous local m…
Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness
Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-reali…
Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints
How network structure determines function is a fundamental question, and it can be investigated by graph ensembles with precisely controlled structural properties. Canonical approa…
Score-Based One-step MeanFlow Policy Optimization
Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhe…
Curriculum reinforcement learning with measurable task representation learning
In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using t…
Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control
Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-…
Understanding Reinforcement Learning with Human Feedback Part 4: Teaching Models Human Preferences
In the previous article, we explored the part where we collect human preferences. In this article, we...…
Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX
Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of chal…
GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents
Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception …
Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty
Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real huma…
FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning
Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants…
Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning
While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present …
ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison
Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual cl…
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning
Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains…
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degrad…
Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs
We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimiza…
Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines
Design for manufacturing plays a critical role in advanced aeroengine development, where complex components necessitate careful consideration of manufacturability. However, current…
AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback
Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptiv…
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floati…
Systematic Reward Hacking and Prime Sprints
We release tunable RL templates that demonstrate reward hacking at 1B scale and introduce Prime Sprints, an open-access program with sponsored runs for community research.…
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play
We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs.…
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed…
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once disc…
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a mo…
Memory-Augmented Reinforcement Learning Agent for CAD Generation
Automatic generation of computer-aided design (CAD) models is a core technology for enabling intelligence in advanced manufacturing. Existing generation methods based on large lang…
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors requir…
ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning
Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This i…
SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs
Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, …
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are …
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
Multi-agent language systems are often built as hand-designed workflows, where agents are assigned semantic roles and communication protocols are specified in advance. We propose N…
Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One prima…
From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning
This paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline, RdeepBot, which uses Monte Carlo samp…
Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning
Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism t…
Self-supervised Hierarchical Visual Reasoning with World Model
3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are ess…
SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation
Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short …
LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning
Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information excha…
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks,…
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on…