Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon
AI-generated Key Points
- Activation Beacon is a method proposed to address the challenge of utilizing long contexts in large language models (LLMs) with limited context window length.
- Activation Beacon is introduced as a plug-and-play module for LLMs that condenses raw activations into more compact forms, allowing them to perceive longer contexts within the limited window.
- The module fully preserves the LLM's original capability on short contexts while extending its capability on processing longer contexts.
- Activation Beacon achieves competitive memory and time efficiency in both training and inference by working with short sliding windows to process the long context.
- The module is learned through an auto-regression task conditioned on a mixture of beacons with diversified condensing ratios and can be efficiently trained purely with short-sequence data in just 10K steps, consuming less than 9 hours on a single GPU machine.
- Experimental studies show that Activation Beacon extends Llama-2-7B's context length by 100 times (from 4K to 400K) and achieves superior results on both long-context generation and understanding tasks compared to fine-tuned full-attention baselines.
- Activation Beacon is further evaluated on five real-world tasks from LongBench, including single-doc QA, multi-doc QA, summarization, few-shot learning, and code completion. It achieves similar performance as fine-tuned methods like LongChat-32K and LongAlpaca-16K.
- Experiments on long context language modeling using three datasets (PG19, Proof-Pile, CodeParrot) demonstrate that Activation Beacon leads to superior long context language modeling performance compared to baseline methods and fine-tuning free methods.
- Overall, Activation Beacon proves to be an effective solution for extending the context length of LLMs without sacrificing their original capabilities. The model and code are available at the BGE repository.
Authors: Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou
Abstract: The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Although the context window can be extended through fine-tuning, it will result in a considerable cost at both training and inference time, and exert an unfavorable impact to the LLM's original capabilities. In this work, we propose Activation Beacon, which condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. Activation Beacon is introduced as a plug-and-play module for the LLM. It fully preserves the LLM's original capability on short contexts while extending the new capability on processing longer contexts. Besides, it works with short sliding windows to process the long context, which achieves a competitive memory and time efficiency in both training and inference. Activation Beacon is learned by the auto-regression task conditioned on a mixture of beacons with diversified condensing ratios. Thanks to such a treatment, it can be efficiently trained purely with short-sequence data in just 10K steps, which consumes less than 9 hours on a single 8xA800 GPU machine. The experimental studies show that Activation Beacon is able to extend Llama-2-7B's context length by $\times100$ times (from 4K to 400K), meanwhile achieving a superior result on both long-context generation and understanding tasks. Our model and code will be available at the BGE repository.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.