Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

AI-generated keywords: Reinforcement Learning Large Language Models Policy Gradient Methods Training Stability Stabilizing Techniques

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Authors propose a novel formulation for reinforcement learning (RL) using large language models
Surrogate token-level objective in policy gradient methods like REINFORCE optimizes true sequence-level reward
Importance of techniques such as importance sampling correction, clipping, and Routing Replay for stabilizing RL training highlighted
On-policy training with basic policy gradient algorithm and importance sampling correction yields highest training stability
Off-policy updates to accelerate convergence require combining clipping and Routing Replay to mitigate instability from policy staleness
Prolonged optimization leads to comparable final performance regardless of cold-start initialization

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, An Yang, Jingren Zhou, Junyang Lin

arXiv: 2512.01374v1 - DOI (cs.LG)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.

Submitted to arXiv on 01 Dec. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2512.01374v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In their paper titled "Stabilizing Reinforcement Learning with LLMs: Formulation and Practices," authors Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, An Yang, Jingren Zhou, and Junyang Lin propose a novel formulation for reinforcement learning (RL) using large language models. They explore how the true sequence-level reward can be optimized through a surrogate token-level objective in policy gradient methods like REINFORCE. By conducting a first-order approximation analysis, the authors demonstrate that this surrogate objective becomes increasingly valid when minimizing both training-inference discrepancy and policy staleness. The study sheds light on the importance of various techniques in stabilizing RL training including importance sampling correction, clipping and Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments involving a 30B MoE model and hundreds of thousands of GPU hours,the authors find that on-policy training with the basic policy gradient algorithm and importance sampling correction yields the highest training stability. They also highlight that introducing off-policy updates to accelerate convergence requires combining clipping and Routing Replay to mitigate instability stemming from policy staleness. Once training is stabilized using these techniques,prolonged optimization consistently leads to comparable final performance regardless of cold-start initialization. The authors hope that by sharing their insights and developed strategies for stable RL training in this paper will facilitate future research in this area.

- Authors propose a novel formulation for reinforcement learning (RL) using large language models
- Surrogate token-level objective in policy gradient methods like REINFORCE optimizes true sequence-level reward
- Importance of techniques such as importance sampling correction, clipping, and Routing Replay for stabilizing RL training highlighted
- On-policy training with basic policy gradient algorithm and importance sampling correction yields highest training stability
- Off-policy updates to accelerate convergence require combining clipping and Routing Replay to mitigate instability from policy staleness
- Prolonged optimization leads to comparable final performance regardless of cold-start initialization

Summary- Authors have a new way to teach computers to learn better using big language models. - They use a special method called REINFORCE to help the computer get better rewards. - Some important techniques like importance sampling correction, clipping, and Routing Replay are needed for teaching the computer well. - Training the computer directly with basic methods is good for stability. - To make the computer learn faster, we need to combine different methods. Definitions- Reinforcement learning (RL): A type of machine learning where an algorithm learns by interacting with its environment and receiving rewards or punishments based on its actions. - Policy gradient methods: Techniques used in RL that optimize policies by adjusting their parameters to maximize expected rewards. - Importance sampling correction: A method used in statistics and machine learning to adjust estimates based on how likely each sample is under different distributions. - Clipping: Limiting the range of values to prevent them from becoming too large or too small during training. - Routing Replay: A technique in RL that helps stabilize training by replaying important experiences.

Reinforcement learning (RL) has been a popular area of research in recent years due to its potential for solving complex decision-making problems. However, one major challenge in RL is the instability of training, which can lead to poor performance and slow convergence. In their paper titled "Stabilizing Reinforcement Learning with LLMs: Formulation and Practices," Chujie Zheng et al. propose a novel formulation for reinforcement learning using large language models (LLMs) to address this issue. The authors start by highlighting the importance of optimizing the true sequence-level reward in RL tasks. They explain that traditional policy gradient methods like REINFORCE use a token-level objective as a surrogate for the sequence-level reward, which can result in suboptimal performance. To overcome this limitation, they introduce a new formulation that leverages LLMs to optimize the true sequence-level reward directly. To support their proposed formulation, the authors conduct a first-order approximation analysis and show that it becomes increasingly valid when minimizing both training-inference discrepancy and policy staleness. This finding highlights the importance of these two factors in stabilizing RL training. Next, they delve into various techniques that can be used to stabilize RL training, including importance sampling correction, clipping, and Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments involving a 30B MoE model and hundreds of thousands of GPU hours,the authors find that on-policy training with basic policy gradient algorithm and importance sampling correction yields the highest stability during training. They also demonstrate that introducing off-policy updates can accelerate convergence but requires combining clipping and Routing Replay to mitigate instability caused by policy staleness. This finding suggests that careful consideration must be given when incorporating off-policy updates into RL algorithms. Furthermore, the authors highlight an interesting observation from their experiments - once training is stabilized using these techniques, prolonged optimization consistently leads to comparable final performance regardless of cold-start initialization. This finding is significant as it suggests that the proposed formulation and techniques can be applied to a wide range of RL tasks without the need for task-specific tuning. Overall, this paper provides valuable insights into stabilizing RL training using LLMs. By sharing their findings and developed strategies, the authors hope to facilitate future research in this area. They also highlight potential directions for further exploration, such as incorporating more advanced optimization methods and exploring different types of LLMs. In conclusion, "Stabilizing Reinforcement Learning with LLMs: Formulation and Practices" is a well-written and informative paper that sheds light on the importance of optimizing true sequence-level reward in RL tasks. The proposed formulation and techniques have been extensively tested through experiments, providing strong evidence for their effectiveness in stabilizing RL training. This paper will undoubtedly serve as a valuable resource for researchers working in this field and pave the way for future advancements in reinforcement learning with large language models.

Created on 29 Jan. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

74.4%

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Dive…

cs.LG

71.5%

Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training

cs.LG

70.5%

Guiding Pretraining in Reinforcement Learning with Large Language Models

cs.LG

68.0%

Model-Based Reinforcement Learning with Adversarial Training for Online Recom…

cs.LG

67.3%

Grounding Large Language Models in Interactive Environments with Online Reinf…

cs.LG

66.5%

Offline Reinforcement Learning for LLM Multi-Step Reasoning

cs.LG

66.5%

Data-Efficient Hierarchical Reinforcement Learning

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.