Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

AI-generated keywords: Light-R1

AI-generated Key Points

  • Light-R1 is an open-source suite for training long reasoning models in a reproducible and cost-effective manner
  • Curriculum training with increasing data difficulty and multi-staged post-training techniques are key components of the methodology
  • The Light-R1-32B model outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning
  • Fine-tuning DeepSeek-R1-Distilled models with 3,000 challenging examples from the curriculum dataset leads to state-of-the-art 7B and 14B models
  • The final model, Light-R1-14B-DS, achieves state-of-the-art performance in math with AIME24 & 25 scores surpassing many other models
  • Light-R1 demonstrates strong cross-domain generalization capabilities
  • Models, training data, and code are openly available at https://github.com/Qihoo360/Light-R1
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang

v4: ACL'25 industry track camera ready; v3: minor modifications; v2: better writing & format for later submission; all release at https://github.com/Qihoo360/Light-R1
License: CC BY 4.0

Abstract: This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning. Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1.

Submitted to arXiv on 13 Mar. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2503.10460v4

, , , , In this paper, we introduce Light-R1, an open-source suite designed for training long reasoning models in a reproducible and cost-effective manner. Our methodology involves curriculum training that gradually increases the difficulty of the data, coupled with multi-staged post-training techniques. The Light-R1-32B model demonstrates superior performance in math reasoning compared to DeepSeek-R1-Distill-Qwen-32B. Through experimental results, we show that our curriculum approach is most effective when diverse datasets are available for different training stages. Fine-tuning DeepSeek-R1-Distilled models with 3,000 challenging examples from our curriculum dataset has led to state-of-the-art 7B and 14B models. Additionally, the 32B model, Light-R1-32B-DS performs comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our research by implementing GRPO on long reasoning models. Our final model, Light-R1-14B-DS achieves state-of-the-art performance among 14B models in math with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite its focus on math training, Light-R1-14B-DS showcases strong cross-domain generalization capabilities. Overall, Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data, and code are openly available at https://github.com/Qihoo360/Light-R1 for further exploration and implementation purposes.
Created on 05 Feb. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.