How to Train Long-Context Language Models (Effectively)

AI-generated keywords: Long-Context Language Models

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Study focuses on continued training and supervised fine-tuning of language models (LM) for long-context information
  • Establishes robust evaluation protocol using diverse long-context tasks post-SFT with instruction data
  • Experimentation reveals importance of optimal data mix, selection of instruction tuning datasets, and leveraging sources like code repositories and books
  • Training with sequence length exceeding evaluation length enhances long-context performance
  • Using only short instruction datasets during SFT can lead to strong performance on long-context tasks
  • Introduction of ProLong-8B model surpasses previous models in long-context performance at a length of 128K tokens, showcasing exceptional processing capabilities up to 512K tokens
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Tianyu Gao, Alexander Wettig, Howard Yen, Danqi Chen

Our code, data, and models are available at https://github.com/princeton-nlp/ProLong

Abstract: We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development -- Instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context tasks, and we evaluate models after SFT with instruction data as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.18B-Instruct on the majority of long-context tasks despite having seen only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.

Submitted to arXiv on 03 Oct. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2410.02660v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their study titled "How to Train Long-Context Language Models (Effectively)," authors Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen delve into the realm of continued training and supervised fine-tuning (SFT) of language models (LM) to harness the power of long-context information. The researchers establish a robust evaluation protocol that moves beyond traditional metrics like perplexity or simple needle-in-a-haystack tests, opting instead for a diverse set of long-context tasks. By evaluating models post-SFT with instruction data, they are able to better gauge long-context capabilities. Through meticulous experimentation, the team explores various aspects such as the optimal data mix for continued pre-training and the selection of instruction tuning datasets. They discover that leveraging sources like code repositories and books can provide valuable long data but emphasize the importance of complementing them with high-quality short data. Additionally, they find that training with a sequence length exceeding the evaluation length significantly enhances long-context performance. The study reveals that using only short instruction datasets during SFT can lead to strong performance on long-context tasks. The culmination of their efforts is ProLong-8B, an advanced model initialized from Llama-3 and trained on 40B tokens. ProLong showcases state-of-the-art long-context performance among models of similar size at a length of 128K, surpassing Llama-3.18B-Instruct on most long-context tasks despite being exposed to only 5% as many tokens during training. Moreover, ProLong demonstrates exceptional processing capabilities by effectively handling up to 512K tokens, boasting one of the longest context windows among publicly available LM's. The findings from this study offer valuable insights into optimizing training strategies for language models to excel in capturing and utilizing extensive contextual information effectively.
Created on 06 Oct. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.