In the study, the researchers introduce rStar-Math, a novel approach that demonstrates how small language models (SLMs) can achieve math reasoning capabilities comparable to or even surpassing those of OpenAI o1 without the need for distillation from superior models. This is achieved through the utilization of "deep thinking" via Monte Carlo Tree Search (MCTS), where a math policy SLM conducts test-time search guided by an SLM-based process reward model. The researchers introduce three key innovations to address challenges in training the two SLMs:
Firstly, they propose a novel code-augmented CoT data synthesis method that utilizes extensive MCTS rollouts to generate step-by-step verified reasoning trajectories for training the policy SLM. Secondly, they present a unique process reward model training method that eliminates the need for naive step-level score annotations, resulting in a more effective process preference model (PPM). Lastly, they implement a self-evolution strategy where both the policy SLM and PPM are built from scratch and iteratively evolved to enhance their reasoning capabilities. Through four rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math significantly boosts SLMs' math reasoning abilities to state-of-the-art levels. On benchmark tests such as MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, outperforming o1-preview by +4.5% and +0.9%. Additionally, on the USA Math Olympiad (AIME), rStar-Math successfully solves an average of 53.3% (8/15) of problems, placing it among the top 20% of high school math students. The methodology employed involves using two 7B SLMs (policy SLM and PRM) to generate higher-quality training data through extensive MCTS rollouts on accessible hardware. Despite challenges faced by SLMs in generating correct solutions and intermediate steps compared to advanced models like GPT-4, innovative strategies such as code-augmented CoT synthetic methods and self-evolution recipes are implemented to improve performance on challenging problems. Furthermore, a detailed explanation is provided on how step-by-step verified reasoning trajectories with per-step Q-value annotations are generated using MCTS and code-augmented CoT synthesis methods. By filtering out low-quality generations and conducting extensive rollouts, the reliability of Q-value accuracy is enhanced in order to construct a high-quality training set for improving SLM performance on complex mathematical problems. Overall, rStar-Math showcases significant advancements in enhancing small language models' math reasoning capabilities through innovative methodologies and iterative self-evolution strategies, ultimately achieving impressive results on various benchmark tests and competitions within the field of mathematics problem-solving.
- - Introduction of rStar-Math, a novel approach utilizing small language models (SLMs) for math reasoning capabilities comparable to or surpassing OpenAI o1
- - Utilization of "deep thinking" via Monte Carlo Tree Search (MCTS) for test-time search guided by an SLM-based process reward model
- - Three key innovations to address training challenges:
- - Novel code-augmented CoT data synthesis method using MCTS rollouts for verified reasoning trajectories
- - Unique process reward model training method eliminating naive step-level score annotations
- - Self-evolution strategy enhancing reasoning capabilities through iterative evolution of policy SLM and PPM
- - Significant improvement in SLMs' math reasoning abilities through four rounds of self-evolution with millions of synthesized solutions for math problems
- - Performance enhancement on benchmark tests such as MATH benchmark and USA Math Olympiad (AIME)
- - Methodology involving two 7B SLMs generating higher-quality training data through extensive MCTS rollouts on accessible hardware
- - Implementation of innovative strategies like code-augmented CoT synthetic methods and self-evolution recipes to overcome challenges faced by SLMs compared to advanced models like GPT-4
- - Detailed explanation provided on generating step-by-step verified reasoning trajectories with per-step Q-value annotations using MCTS and code-augmented CoT synthesis methods
- - Enhancement of Q-value accuracy reliability through filtering out low-quality generations and conducting extensive rollouts for constructing high-quality training sets
- - Showcase of significant advancements in enhancing small language models' math reasoning capabilities through innovative methodologies and iterative self-evolution strategies, achieving impressive results in mathematics problem-solving.
Summary- rStar-Math is a new way to help with math problems using small language models (SLMs) that can think like OpenAI o1 or even better.
- They use something called Monte Carlo Tree Search (MCTS) to think deeply and find answers during tests, guided by the SLM process reward model.
- Three important new ideas were used to make training easier - A special way to make data called CoT using MCTS for correct thinking paths
- A different way to train the reward model without simple score notes
- A plan for getting better at thinking by evolving the SLM and PPM over time
- The SLMs got much better at math after four rounds of self-improvement with millions of practice solutions.
- They did very well on tests like MATH benchmark and USA Math Olympiad thanks to these improvements.
Definitions- Small language models (SLMs): Programs that understand and generate human-like text but are smaller in size compared to bigger models like GPT-4.
- Monte Carlo Tree Search (MCTS): A method used in decision-making processes where possible outcomes are simulated through random sampling.
- Process reward model: A system that provides feedback or rewards based on how well a process or task is completed.
- CoT data synthesis: Creating new data by combining existing information using methods like MCTS rollouts for accurate reasoning paths.
- Self-evolution strategy: Improving abilities over time through iterative changes and adaptations
Introduction
In recent years, there has been a growing interest in developing artificial intelligence (AI) systems that can solve complex mathematical problems. However, most existing models rely on large-scale language models such as GPT-4 for their reasoning capabilities. This poses a challenge for smaller language models (SLMs) that do not have access to such advanced models for distillation. In response to this issue, researchers have introduced rStar-Math, a novel approach that demonstrates how SLMs can achieve math reasoning capabilities comparable to or even surpassing those of OpenAI o1 without the need for distillation from superior models.
The Problem
The main challenge faced by SLMs is their limited ability to generate correct solutions and intermediate steps compared to larger and more advanced language models like GPT-4. This makes it difficult for them to perform well on complex mathematical problems. Additionally, the lack of access to advanced models for distillation further hinders their performance.
The Solution: rStar-Math
To address these challenges, the researchers behind rStar-Math introduce three key innovations in training two SLMs - policy SLM and process reward model (PRM).
Innovation 1: Code-Augmented CoT Data Synthesis Method
The first innovation proposed by the researchers is a code-augmented CoT data synthesis method that utilizes extensive Monte Carlo Tree Search (MCTS) rollouts to generate step-by-step verified reasoning trajectories for training the policy SLM.
This method involves conducting MCTS rollouts on accessible hardware using two 7B SLMs - one acting as the agent and another as an environment simulator. By simulating different actions and outcomes, this method generates high-quality training data with per-step Q-value annotations.
Innovation 2: Process Reward Model Training Method
The second innovation is a unique process reward model training method that eliminates the need for naive step-level score annotations. This results in a more effective process preference model (PPM) that guides the policy SLM during test-time search.
By using MCTS rollouts and code-augmented CoT data synthesis, this method trains the PPM to accurately predict the quality of intermediate steps in solving mathematical problems. This eliminates the need for manual annotation and improves the reliability of Q-value accuracy.
Innovation 3: Self-Evolution Strategy
Lastly, rStar-Math implements a self-evolution strategy where both the policy SLM and PPM are built from scratch and iteratively evolved to enhance their reasoning capabilities. Through four rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math significantly boosts SLMs' math reasoning abilities to state-of-the-art levels.
Results
Through its innovative methodologies and iterative self-evolution strategies, rStar-Math achieves impressive results on various benchmark tests and competitions within the field of mathematics problem-solving.
On benchmark tests such as MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, outperforming o1-preview by +4.5% and +0.9%. Additionally, on the USA Math Olympiad (AIME), rStar-Math successfully solves an average of 53.3% (8/15) of problems, placing it among the top 20% of high school math students.
Methodology
To achieve these impressive results, rStar-Math utilizes two key components - extensive MCTS rollouts and code-augmented CoT data synthesis.
Extensive MCTS Rollouts
MCTS is a search algorithm that simulates different actions and outcomes to find the best possible solution for a given problem. In rStar-Math, MCTS rollouts are used to generate high-quality training data with per-step Q-value annotations.
By conducting extensive rollouts and filtering out low-quality generations, the reliability of Q-value accuracy is enhanced. This results in a high-quality training set that improves SLM performance on complex mathematical problems.
Code-Augmented CoT Data Synthesis
The code-augmented CoT data synthesis method involves using two 7B SLMs - one acting as an agent and another as an environment simulator - to generate step-by-step verified reasoning trajectories. These trajectories are then used for training the policy SLM.
Through this method, rStar-Math is able to generate higher-quality training data compared to traditional methods, resulting in improved performance on challenging mathematical problems.
Conclusion
In conclusion, rStar-Math showcases significant advancements in enhancing small language models' math reasoning capabilities through innovative methodologies and iterative self-evolution strategies. By utilizing extensive MCTS rollouts and code-augmented CoT data synthesis, it achieves impressive results on various benchmark tests and competitions within the field of mathematics problem-solving. This research opens up new possibilities for smaller language models to excel in complex tasks without relying on distillation from larger and more advanced models.