Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning

AI-generated keywords: Machine Learning

AI-generated Key Points

  • Researchers investigate the use of Language Model (LLM)-generated data for fine-tuning in machine learning and its impact on cross-domain generalization.
  • Using LLM-generated data enhances target task performance and reduces degradation in non-target tasks compared to using ground truth data.
  • Reduction of high perplexity tokens in LLM-generated sequences is key to achieving improved performance, as demonstrated across different model families and scales.
  • Two methods for generating training data - Self-Output and Rephrase - are explored for two target tasks, with evaluation on five non-target tasks.
  • Detailed information is provided on original datasets used for data generation and methodology for constructing self-generated training datasets using language models.
  • The research offers valuable insights into leveraging LLM-generated data for fine-tuning to improve model performance across domains and mitigate forgetting issues in machine learning applications.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chao-Chung Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Shao-Hua Sun, Hung-yi Lee

Accepted to NeurIPS 2025
License: CC BY-SA 4.0

Abstract: Maintaining consistent model performance across domains is a fundamental challenge in machine learning. While recent work has explored using LLM-generated data for fine-tuning, its impact on cross-domain generalization remains poorly understood. This paper presents a systematic analysis revealing that fine-tuning with LLM-generated data not only improves target task performance but also reduces non-target task degradation compared to fine-tuning with ground truth data. Through analyzing the data sequence in tasks of various domains, we demonstrate that this enhancement of non-target task robustness stems from the reduction of high perplexity tokens found in LLM-generated sequences. Following our findings, we showed that masking high perplexity tokens in ground truth training data achieves similar non-target task performance preservation, comparable to using LLM-generated data. Extensive experiments across different model families and scales, including Gemma 2 IT 2B, Llama 3 8B Instruct, and three additional models, agree with our findings. To the best of our knowledge, this is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning, offering valuable insights for developing more robust fine-tuning strategies.

Submitted to arXiv on 24 Jan. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.14315v6

In this study, the researchers investigate the use of Language Model (LLM)-generated data for fine-tuning in machine learning and its impact on cross-domain generalization. They conduct a systematic analysis to reveal that using LLM-generated data not only enhances target task performance but also reduces degradation in non-target tasks compared to using ground truth data. This is achieved by reducing high perplexity tokens present in LLM-generated sequences, as shown through experiments across different model families and scales. The study also explores two methods for generating training data - Self-Output and Rephrase - for two different target tasks and evaluates their effectiveness on five non-target tasks. Additionally, detailed information is provided on the original datasets used for data generation and the methodology for constructing self-generated training datasets using language models. Overall, this research offers valuable insights into leveraging LLM-generated data for fine-tuning to improve model performance across domains and mitigate forgetting issues commonly observed in machine learning applications.
Created on 30 Oct. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.