RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

AI-generated keywords: Math reasoning tasks Synthetic data Language models Negative responses Positive synthetic data

AI-generated Key Points

  • Authors explore training language models on model-generated synthetic data for math reasoning tasks
  • Sampling more correct solutions from the finetuned learner and fine-tuning on self-generated data doubles efficiency in solving synthetic problems
  • Constructing negative responses to mitigate potential pitfalls of training on model-generated positives leads to consistent gains
  • Comparison between positive synthetic data from larger models like GPT-4 and Gemini 1.5 Pro with self-generated positive data highlights benefits of learning generalizable features
  • Training on both positive and negative synthetic data enhances reasoning abilities, mitigates biases, and prevents undesirable memorization
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, Aviral Kumar

License: CC BY 4.0

Abstract: Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data $\textbf{doubles}$ the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize negative responses, i.e., model-generated responses that are deemed incorrect by a final answer verifier. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or advantage of each intermediate step in the negative response. With this per-step scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by $\mathbf{8 \times}$. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits robustness benefits of RL over imitating positive data alone.

Submitted to arXiv on 20 Jun. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2406.14532v1

The authors of this paper delve into the realm of training language models on model-generated synthetic data for math reasoning tasks. They begin by exploring the effectiveness of finetuning LLMs on synthetic correct or positive problem-solution pairs generated by proficient models. However, they make a groundbreaking discovery that sampling more correct solutions from the finetuned learner itself and subsequently fine-tuning on this self-generated data results in a doubling of efficiency when solving the same synthetic problems. The study also uncovers potential pitfalls of training on model-generated positives and introduces negative responses to mitigate these issues. By constructing negatives that allow for appropriate recovery of each intermediate step's utility or advantage, they achieve consistent gains over using only positive data. Furthermore, the paper delves into related works in the field and compares performance scaling with positive synthetic data from larger models like GPT-4 and Gemini 1.5 Pro with self-generated positive data. Additionally, the study explores the benefits and nuances of negative synthetic data in math reasoning tasks and establishes an equivalence between preference optimization and advantage-weighted reinforcement learning through a framework of offline preference optimization. Overall, this comprehensive analysis sheds light on how training language models on both positive and negative synthetic data can enhance reasoning abilities while mitigating biases and spurious correlations often associated with solely relying on positive responses. : Training language models on model-generated synthetic data for mathematical problem-solving. : Generated by proficient models to improve performance gains when finetuning LLMs. : Trained on both positive and negative synthetic data to enhance reasoning abilities. : Introduced to mitigate potential pitfalls of training solely on model-generated positives. : Used in comparison with self-generated positive data to highlight the benefits of learning generalizable features and preventing undesirable memorization.
Created on 01 Jul. 2024

Assess the quality of the AI-generated content by voting

Score: -1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.