Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

AI-generated keywords: Large Reasoning Models Adaptive Thinking AutoThink Reinforcement Learning Efficiency

AI-generated Key Points

Large Reasoning Models (LRMs) can generate explicit reasoning sequences before arriving at final answers
Detailed reasoning process can lead to computational overhead and latency, especially for simpler problems
Introducing adaptive thinking in LRMs to dynamically decide if explicit reasoning is necessary based on problem complexity
Incorporating a simple ellipsis ("...") into the prompt triggers either a thinking or no-thinking mode in the model
AutoThink is a multi-stage reinforcement learning (RL) framework that optimizes reasoning policies through stage-wise reward shaping
AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods on mainstream mathematical benchmarks
AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B
Potential issues with incomplete behavioral separation between thinking and answering, unfiltered training data utilization, and future research directions for performance enhancements

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, Dongbin Zhao

arXiv: 2505.10832v1 - DOI (cs.CL)

Project Page: https://github.com/TU2021/AutoThink

License: CC BY-SA 4.0

Abstract: Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs.

Submitted to arXiv on 16 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.10832v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In this work, we delve into the realm of Large Reasoning Models (LRMs) and their ability to generate explicit reasoning sequences before arriving at final answers. However, the detailed reasoning process can lead to computational overhead and latency, especially for simpler problems. To tackle this issue of overthinking, we introduce the concept of adaptive thinking in LRMs. By equipping LRMs with the capability to dynamically decide whether explicit reasoning is necessary based on problem complexity, we aim to enhance efficiency without compromising accuracy. Our approach involves incorporating a simple ellipsis ("...") into the prompt, which stochastically triggers either a thinking or no-thinking mode in the model. This reveals a latent controllability in the reasoning behavior of LRMs. Building upon R1-style distilled models, we propose AutoThink – a multi-stage reinforcement learning (RL) framework that optimizes reasoning policies through stage-wise reward shaping. AutoThink learns to engage in explicit reasoning only when essential, defaulting to succinct responses for simpler tasks. Experimental results on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can seamlessly integrate into any R1-style model, including both distilled and fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs. While AutoThink showcases promising adaptive reasoning capabilities . These include potential reward hacking issues where incomplete behavioral separation between thinking and answering may occur , and unfiltered training data utilization from the DeepScaleR dataset without task difficulty filtering. Future research directions could explore budget-aware CoT generation and curriculum-based filtering for further performance enhancements. Moreover, additional analyses on reasoning behaviors, training cost considerations, and a case study are presented in Appendix B due to space constraints. The related works section highlights existing RL-based post-training techniques for LLMs and strategies to mitigate overthinking in LRMs through self-generated short CoT signals and pseudo-thinking cues in prompts. In conclusion, our study introduces an innovative approach towards enhancing efficiency in LRMs through adaptive thinking mechanisms guided by multi-stage RL frameworks. By addressing the challenge of overthinking while maintaining performance standards .

- Large Reasoning Models (LRMs) can generate explicit reasoning sequences before arriving at final answers
- Detailed reasoning process can lead to computational overhead and latency, especially for simpler problems
- Introducing adaptive thinking in LRMs to dynamically decide if explicit reasoning is necessary based on problem complexity
- Incorporating a simple ellipsis ("...") into the prompt triggers either a thinking or no-thinking mode in the model
- AutoThink is a multi-stage reinforcement learning (RL) framework that optimizes reasoning policies through stage-wise reward shaping
- AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods on mainstream mathematical benchmarks
- AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B
- Potential issues with incomplete behavioral separation between thinking and answering, unfiltered training data utilization, and future research directions for performance enhancements

Summary- Big thinking models can figure out step-by-step reasoning before giving final answers. - Thinking too much about each step can make it take longer to find the answer, especially for easy problems. - Making big thinking models smarter by deciding when to think a lot based on how hard the problem is. - Using "..." in a question can make the model decide if it needs to think or not. - AutoThink is a smart way of teaching computers to reason better and faster. Definitions- Large Reasoning Models (LRMs): Big computer programs that think through problems step by step. - Latency: The time it takes for something to happen, like finding an answer on a computer. - Adaptive thinking: Being able to change how you think based on the problem you're trying to solve. - Ellipsis: Three dots ("...") used in writing to show that something has been left out or there's more to come. - Reinforcement Learning (RL): A way of teaching computers by rewarding them when they do something right.

Large reasoning models (LRMs) have been gaining popularity in recent years due to their ability to generate explicit reasoning sequences before arriving at final answers. This detailed reasoning process allows for a deeper understanding of complex problems, but it can also lead to computational overhead and latency, especially for simpler problems. To address this issue, researchers have introduced the concept of adaptive thinking in LRMs. In their research paper titled "AutoThink: Towards Adaptive Thinking in Large Reasoning Models," the authors propose a new approach that equips LRMs with the capability to dynamically decide whether explicit reasoning is necessary based on problem complexity. This not only enhances efficiency but also maintains accuracy standards. The Approach The approach involves incorporating a simple ellipsis ("...") into the prompt, which stochastically triggers either a thinking or no-thinking mode in the model. This reveals a latent controllability in the reasoning behavior of LRMs. The authors build upon R1-style distilled models and propose AutoThink – a multi-stage reinforcement learning (RL) framework that optimizes reasoning policies through stage-wise reward shaping. How Does AutoThink Work? AutoThink learns to engage in explicit reasoning only when essential, defaulting to succinct responses for simpler tasks. It does this by using RL techniques to optimize its decision-making process at each stage of problem-solving. By doing so, it can seamlessly integrate into any R1-style model, including both distilled and fine-tuned variants. Experimental Results To evaluate the effectiveness of AutoThink, experimental results were conducted on five mainstream mathematical benchmarks. These results demonstrated that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. Notably, on DeepSeek-R1-Distill-Qwen-1.5B dataset, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent – establishing itself as a scalable and adaptive reasoning paradigm for LRMs. Potential Challenges and Future Directions While AutoThink showcases promising adaptive reasoning capabilities, there are potential challenges that need to be addressed. These include potential reward hacking issues where incomplete behavioral separation between thinking and answering may occur, and unfiltered training data utilization from the DeepScaleR dataset without task difficulty filtering. Future research directions could explore budget-aware CoT generation and curriculum-based filtering for further performance enhancements. Moreover, additional analyses on reasoning behaviors, training cost considerations, and a case study are presented in Appendix B due to space constraints. Related Works The related works section highlights existing RL-based post-training techniques for LLMs and strategies to mitigate overthinking in LRMs through self-generated short CoT signals and pseudo-thinking cues in prompts. This shows how AutoThink builds upon previous research efforts while also introducing new ideas towards enhancing efficiency in LRMs. Conclusion In conclusion, "AutoThink: Towards Adaptive Thinking in Large Reasoning Models" introduces an innovative approach towards enhancing efficiency in LRMs through adaptive thinking mechanisms guided by multi-stage RL frameworks. By addressing the challenge of overthinking while maintaining performance standards, this research paper opens up new possibilities for future developments in the field of large reasoning models.

Created on 21 May. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

67.1%

SEAL: Steerable Reasoning Calibration of Large Language Models for Free

cs.CL

63.3%

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by L…

cs.CL

62.6%

Emergent Abilities of Large Language Models

cs.CL

61.4%

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

cs.CL

61.2%

Large Language Models Cannot Self-Correct Reasoning Yet

cs.CL

61.1%

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL

60.7%

Constitutional AI: Harmlessness from AI Feedback

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.