LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

AI-generated keywords: Task-agnostic prompt compression Generalizability Efficiency Data distillation Transformer encoder

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Task-agnostic prompt compression for enhancing generalizability and efficiency
  • Novel data distillation procedure to extract knowledge from Language Models (LLMs) for prompt compression
  • Formulation of prompt compression as a token classification problem using a Transformer encoder
  • Lower latency and explicit learning of compression objective with smaller models like XLM-RoBERTa-large and mBERT
  • Extensive evaluation on various datasets showing significant performance improvements over strong baselines
  • Robust generalization capabilities across different Language Models (LLMs)
  • 3x-6x faster speed compared to existing methods, accelerating end-to-end latency by 1.6x-2.9x
  • Impressive compression ratios ranging from 2x-5x
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang

Abstract: This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x.

Submitted to arXiv on 19 Mar. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2403.12968v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

This paper by Zhuoshi Pan et al. delves into the realm of task-agnostic prompt compression to enhance generalizability and efficiency. The focus lies on addressing the redundancy present in natural language through the compression of prompts. Existing methodologies typically involve removing tokens or lexical units based on their information entropy derived from a causal language model like LLaMa-7B. However, a key challenge arises as information entropy may not be the most optimal metric for prompt compression due to its reliance on unidirectional context and potential failure to capture all essential information required for effective compression. To combat these issues effectively, the authors propose a novel data distillation procedure aimed at extracting knowledge from a Language Model (LLM) to compress prompts without sacrificing crucial information. Additionally, they introduce an extractive text compression dataset to further enhance their approach. Prompt compression is formulated as a token classification problem to ensure the fidelity of the compressed prompt compared to the original one. A Transformer encoder serves as the foundational architecture utilized to capture all necessary information for prompt compression from full bidirectional context. The proposed methodology not only leads to lower latency but also explicitly learns the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. Extensive evaluation conducted on both in-domain and out-of-domain datasets including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH showcases significant performance improvements over strong baselines. The model demonstrates robust generalization capabilities across various Language Models (LLMs). Moreover, it proves to be 3x-6x faster than existing prompt compression methods while accelerating end-to-end latency by 1.6x-2.9x with impressive compression ratios ranging from 2x-5x. In conclusion, this research presents an innovative approach towards efficient and faithful task-agnostic prompt compression that not only enhances performance but also showcases remarkable speed improvements over existing methodologies.
Created on 31 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.