Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon

AI-generated keywords: Activation Beacon Long Contexts Large Language Models Efficient Training Superior Performance

AI-generated Key Points

  • Activation Beacon is a method proposed to address the challenge of utilizing long contexts in large language models (LLMs) with limited context window length.
  • Activation Beacon is introduced as a plug-and-play module for LLMs that condenses raw activations into more compact forms, allowing them to perceive longer contexts within the limited window.
  • The module fully preserves the LLM's original capability on short contexts while extending its capability on processing longer contexts.
  • Activation Beacon achieves competitive memory and time efficiency in both training and inference by working with short sliding windows to process the long context.
  • The module is learned through an auto-regression task conditioned on a mixture of beacons with diversified condensing ratios and can be efficiently trained purely with short-sequence data in just 10K steps, consuming less than 9 hours on a single GPU machine.
  • Experimental studies show that Activation Beacon extends Llama-2-7B's context length by 100 times (from 4K to 400K) and achieves superior results on both long-context generation and understanding tasks compared to fine-tuned full-attention baselines.
  • Activation Beacon is further evaluated on five real-world tasks from LongBench, including single-doc QA, multi-doc QA, summarization, few-shot learning, and code completion. It achieves similar performance as fine-tuned methods like LongChat-32K and LongAlpaca-16K.
  • Experiments on long context language modeling using three datasets (PG19, Proof-Pile, CodeParrot) demonstrate that Activation Beacon leads to superior long context language modeling performance compared to baseline methods and fine-tuning free methods.
  • Overall, Activation Beacon proves to be an effective solution for extending the context length of LLMs without sacrificing their original capabilities. The model and code are available at the BGE repository.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou

License: CC BY 4.0

Abstract: The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Although the context window can be extended through fine-tuning, it will result in a considerable cost at both training and inference time, and exert an unfavorable impact to the LLM's original capabilities. In this work, we propose Activation Beacon, which condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. Activation Beacon is introduced as a plug-and-play module for the LLM. It fully preserves the LLM's original capability on short contexts while extending the new capability on processing longer contexts. Besides, it works with short sliding windows to process the long context, which achieves a competitive memory and time efficiency in both training and inference. Activation Beacon is learned by the auto-regression task conditioned on a mixture of beacons with diversified condensing ratios. Thanks to such a treatment, it can be efficiently trained purely with short-sequence data in just 10K steps, which consumes less than 9 hours on a single 8xA800 GPU machine. The experimental studies show that Activation Beacon is able to extend Llama-2-7B's context length by $\times100$ times (from 4K to 400K), meanwhile achieving a superior result on both long-context generation and understanding tasks. Our model and code will be available at the BGE repository.

Submitted to arXiv on 07 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.03462v1

The authors propose a method called Activation Beacon to address the challenge of utilizing long contexts in large language models (LLMs) with limited context window length. <br> Activation Beacon is introduced as a plug-and-play module for LLMs that condenses raw activations into more compact forms, allowing them to perceive longer contexts within the limited window. This module fully preserves the LLM's original capability on short contexts while extending its capability on processing longer contexts. It achieves competitive memory and time efficiency in both training and inference by working with short sliding windows to process the long context.<br> Activation Beacon is learned through an auto-regression task conditioned on a mixture of beacons with diversified condensing ratios. The module can be efficiently trained purely with short-sequence data in just 10K steps, consuming less than 9 hours on a single GPU machine.<br> Experimental studies show that Activation Beacon is able to extend Llama-2-7B's context length by 100 times (from 4K to 400K), while achieving superior results on both long-context generation and understanding tasks. The method performs on par with fine-tuned full-attention baselines.<br> The authors further evaluate Activation Beacon on five real-world tasks from LongBench, including single-doc QA, multi-doc QA, summarization, few-shot learning, and code completion. The results are reported in Table 3, showing that Activation Beacon achieves similar performance as fine-tuned methods like LongChat-32K and LongAlpaca-16K.<br> Additionally, the authors conduct experiments on long context language modeling using three datasets: PG19, Proof-Pile, and CodeParrot. The perplexity results are reported in Table 2, demonstrating that Activation Beacon leads to superior long context language modeling performance compared to baseline methods and fine-tuning free methods.<br> Overall, Activation Beacon proves to be an effective solution for extending the context length of LLMs without sacrificing their original capabilities. The model and code are available at the BGE repository.
Created on 30 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.