WeSearch
Hub / Tags / Reinforcement Learning
TAG · #REINFORCEMENT-LEARNING

Reinforcement Learning coverage.

Every story in the WeSearch catalog tagged with #reinforcement-learning, chronological, with view counts. Subscribe to the per-tag RSS feed to follow this topic in your reader of choice.

60 stories tagged with #reinforcement-learning, in publish-time order across the WeSearch catalog. Tag pages update as new stories ingest.

⌘ RSS feed for this tag →   or   search "Reinforcement Learning"

RELATED TAGS
#ai52#ml38#language-models3#superintelligence1#tech-startups1#open-source1#large-language-models1#continual-learning1#self-distillation1#idan-shenfeld1#mehul-damani1#jonas-h-botter1
ARXIV CS.AI

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and s…

15 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading docume…

16 views ·
#artificial intelligence#machine learning
DEV.TO (TOP)

Explainable Causal Reinforcement Learning for planetary geology survey missions with embodied agent feedback loops

It was 3 AM, and I was staring at a terminal window filled with telemetry data from a simulated Mars rover. The reinforcement learning (RL) agent I had trained overnight had just c…

14 views ·
#ai#planetary science
DEV.TO (TOP)

Why I built the HuggingFace for RL agents — and why RL needs one

Showcase Video If you've ever tried MineRL or OpenAI Five, you know the feeling. The environment...…

9 views ·
#technology#ai
ARXIV CS.AI

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierarchical counterparts by discovering and re…

22 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely op…

21 views ·
#artificial intelligence#multi-agent systems
ARXIV CS.AI

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools a…

20 views ·
#artificial intelligence#medical
ARXIV CS.AI

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions…

17 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tamperin…

19 views ·
#artificial intelligence#machine learning#bias
ARXIV CS.AI

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes …

17 views ·
#machine learning#artificial intelligence
ARXIV.ORG

Polar: Agentic RL on Any Harness at Scale

Reinforcement learning for language agents increasingly depends on custom harnesses that manage long-running context, multi-turn tool use and multi-agent orchestration. However, po…

15 views ·
#machine learning#software engineering
ARXIV CS.AI

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that…

14 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with un…

16 views ·
#artificial intelligence#teamwork
ARXIV CS.AI

Emission-Aware Reinforcement Learning for Sustainable Electric Vehicle Charging and Carbon Dioxide Reduction Under Varying Renewable Penetration

The rapid growth of Electric Vehicle (EV) adoption challenges power distribution networks through peak load spikes, voltage instability, and transformer overloads from uncoordinate…

20 views ·
#artificial intelligence#electric vehicles#renewable energy
ARXIV CS.AI

CoRe-Code: Collaborative Reinforcement Learning for Code Generation

Large language models (LLMs) have achieved strong performance in code generation, but most methods rely on autoregressive decoding without global planning, often leading to locally…

12 views ·
#artificial intelligence#code generation
ARXIV CS.AI

ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shift…

17 views ·
#artificial intelligence#task scheduling
ARXIV CS.AI

Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmanned combat aerial vehicles (UCAVs) faces s…

12 views ·
#artificial intelligence#air combat
ARXIV CS.AI

Credit Assignment with Resets in Language Model Reasoning

Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all toke…

14 views ·
#artificial intelligence#language models
DEV.TO (TOP)

Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions

In the previous article, we created a reward model. In this article, we will continue exploring how...…

21 views ·
#ai#machinelearning#reinforcementlearning
R/MACHINELEARNING

If you use NVIDIA Isaac Sim for reinforcement learning, do you use Isaac Lab with it? Just want to get a sense of what the status quo is. [D]

17 views ·
ARXIV CS.AI

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignm…

16 views ·
#artificial intelligence#gaming
ARXIV CS.AI

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smo…

9 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

Classical State Preparation for Variational Quantum Algorithms via Reinforcement Learning

Variational Quantum Algorithms (VQAs) potentially offer a pathway to practical quantum advantage, but their optimization is heavily hindered by barren plateaus and numerous local m…

17 views ·
#quantum physics#artificial intelligence#machine learning
ARXIV CS.AI

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-reali…

15 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints

How network structure determines function is a fundamental question, and it can be investigated by graph ensembles with precisely controlled structural properties. Canonical approa…

12 views ·
#machine learning#graph theory
ARXIV CS.AI

Score-Based One-step MeanFlow Policy Optimization

Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhe…

28 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

Curriculum reinforcement learning with measurable task representation learning

In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using t…

14 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-…

12 views ·
#machine learning#artificial intelligence
DEV.TO (TOP)

Understanding Reinforcement Learning with Human Feedback Part 4: Teaching Models Human Preferences

In the previous article, we explored the part where we collect human preferences. In this article, we...…

12 views ·
#ai#machinelearning#reinforcementlearning
ARXIV CS.AI

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of chal…

17 views ·
#artificial intelligence#machine learning#mahjong
ARXIV CS.AI

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception …

17 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real huma…

17 views ·
#machine learning#autonomous driving#pedestrian safety
ARXIV CS.AI

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants…

15 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present …

13 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual cl…

14 views ·
#machine learning#artificial intelligence#computer vision
ARXIV CS.AI

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains…

14 views ·
#computer vision#artificial intelligence#machine learning
ARXIV CS.AI

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degrad…

11 views ·
#machine learning#artificial intelligence#quantization
ARXIV CS.AI

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimiza…

13 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

Design for manufacturing plays a critical role in advanced aeroengine development, where complex components necessitate careful consideration of manufacturability. However, current…

16 views ·
#machine learning#manufacturing#aeroengines
ARXIV CS.AI

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptiv…

13 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floati…

14 views ·
#machine learning#language models
PRIMEINTELLECT

Systematic Reward Hacking and Prime Sprints

We release tunable RL templates that demonstrate reward hacking at 1B scale and introduce Prime Sprints, an open-access program with sponsored runs for community research.…

16 views ·
#reward hacking#research
VMAX

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs.…

20 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed…

13 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once disc…

11 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a mo…

12 views ·
#artificial intelligence#machine learning#security
ARXIV CS.AI

Memory-Augmented Reinforcement Learning Agent for CAD Generation

Automatic generation of computer-aided design (CAD) models is a core technology for enabling intelligence in advanced manufacturing. Existing generation methods based on large lang…

13 views ·
#artificial intelligence#computer-aided design
ARXIV CS.AI

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors requir…

15 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This i…

16 views ·
#machine learning#artificial intelligence#scientific reasoning
ARXIV CS.AI

SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, …

14 views ·
#machine learning#artificial intelligence
ARXIV CS.AI

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are …

13 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

Multi-agent language systems are often built as hand-designed workflows, where agents are assigned semantic roles and communication protocols are specified in advance. We propose N…

14 views ·
#artificial intelligence#machine learning#multiagent systems
ARXIV CS.AI

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One prima…

13 views ·
#artificial intelligence#machine learning#image generation
ARXIV CS.AI

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

This paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline, RdeepBot, which uses Monte Carlo samp…

15 views ·
#artificial intelligence#machine learning
ARXIV CS.AI

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism t…

17 views ·
#artificial intelligence#machine learning#multiagent systems
ARXIV CS.AI

Self-supervised Hierarchical Visual Reasoning with World Model

3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are ess…

12 views ·
#artificial intelligence#visual reasoning
ARXIV CS.AI

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short …

12 views ·
#artificial intelligence#recommendation systems
ARXIV CS.AI

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information excha…

17 views ·
#artificial intelligence#machine learning#multiagent systems
ARXIV CS.AI

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks,…

14 views ·
#artificial intelligence#open-ended generation
ARXIV CS.AI

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on…

17 views ·
#artificial intelligence#language models