Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

AI-generated keywords: Large Reasoning Models Adaptive Thinking AutoThink Reinforcement Learning Efficiency

AI-generated Key Points

  • Large Reasoning Models (LRMs) can generate explicit reasoning sequences before arriving at final answers
  • Detailed reasoning process can lead to computational overhead and latency, especially for simpler problems
  • Introducing adaptive thinking in LRMs to dynamically decide if explicit reasoning is necessary based on problem complexity
  • Incorporating a simple ellipsis ("...") into the prompt triggers either a thinking or no-thinking mode in the model
  • AutoThink is a multi-stage reinforcement learning (RL) framework that optimizes reasoning policies through stage-wise reward shaping
  • AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods on mainstream mathematical benchmarks
  • AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B
  • Potential issues with incomplete behavioral separation between thinking and answering, unfiltered training data utilization, and future research directions for performance enhancements
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, Dongbin Zhao

Project Page: https://github.com/TU2021/AutoThink
License: CC BY-SA 4.0

Abstract: Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs.

Submitted to arXiv on 16 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.10832v1

In this work, we delve into the realm of Large Reasoning Models (LRMs) and their ability to generate explicit reasoning sequences before arriving at final answers. However, the detailed reasoning process can lead to computational overhead and latency, especially for simpler problems. To tackle this issue of overthinking, we introduce the concept of adaptive thinking in LRMs. By equipping LRMs with the capability to dynamically decide whether explicit reasoning is necessary based on problem complexity, we aim to enhance efficiency without compromising accuracy. Our approach involves incorporating a simple ellipsis ("...") into the prompt, which stochastically triggers either a thinking or no-thinking mode in the model. This reveals a latent controllability in the reasoning behavior of LRMs. Building upon R1-style distilled models, we propose AutoThink – a multi-stage reinforcement learning (RL) framework that optimizes reasoning policies through stage-wise reward shaping. AutoThink learns to engage in explicit reasoning only when essential, defaulting to succinct responses for simpler tasks. Experimental results on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can seamlessly integrate into any R1-style model, including both distilled and fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs. While AutoThink showcases promising adaptive reasoning capabilities . These include potential reward hacking issues where incomplete behavioral separation between thinking and answering may occur , and unfiltered training data utilization from the DeepScaleR dataset without task difficulty filtering. Future research directions could explore budget-aware CoT generation and curriculum-based filtering for further performance enhancements. Moreover, additional analyses on reasoning behaviors, training cost considerations, and a case study are presented in Appendix B due to space constraints. The related works section highlights existing RL-based post-training techniques for LLMs and strategies to mitigate overthinking in LRMs through self-generated short CoT signals and pseudo-thinking cues in prompts. In conclusion, our study introduces an innovative approach towards enhancing efficiency in LRMs through adaptive thinking mechanisms guided by multi-stage RL frameworks. By addressing the challenge of overthinking while maintaining performance standards .
Created on 21 May. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.