Long Context vs. RAG for LLMs: An Evaluation and Revisits

AI-generated keywords: Large Language Models Retrieval-Augmented Generation RAPTOR Context Windows External Knowledge Sources

AI-generated Key Points

  • Growing interest in enhancing capabilities of Large Language Models (LLMs) with long external contexts
  • Two main strategies: Long Context (LC) and Retrieval-Augmented Generation (RAG)
  • Notable advancement in retrieval methods: RAPTOR improves accuracy by generating recursive summaries in a tree structure
  • Various LLM models excel in specialized areas such as reasoning efficiency, conversational understanding, text summarization, knowledge understanding, multilingual translation, mathematical computations, and logical reasoning
  • Trend towards increasing context length in newly released models categorized as short (up to 4K), long (up to 32K), and ultra-long (more than 32K) context models
  • Advancements offer potential for handling complex questions requiring information synthesis from multiple document parts
  • Importance of considering context relevance when optimizing LLMs with external knowledge sources
  • Need for tailored approaches based on specific task requirements and further research for more effective utilization of external knowledge sources
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xinze Li, Yixin Cao, Yubo Ma, Aixin Sun

14 pages excluding references and appendix
License: CC BY 4.0

Abstract: Extending context windows (i.e., Long Context, LC) and using retrievers to selectively access relevant information (i.e., Retrieval-Augmented Generation, RAG) are the two main strategies to enable LLMs to incorporate extremely long external contexts. This paper revisits recent studies on this topic, highlighting their key insights and discrepancies. We then provide a more comprehensive evaluation by filtering out questions answerable without external context, identifying the most effective retrieval methods, and expanding the datasets. We show that LC generally outperforms RAG in question-answering benchmarks, especially for Wikipedia-based questions. Summarization-based retrieval performs comparably to LC, while chunk-based retrieval lags behind. However, RAG has advantages in dialogue-based and general question queries. These insights underscore the trade-offs between RAG and LC strategies, offering guidance for future optimization of LLMs with external knowledge sources. We also provide an in-depth discussion on this topic, highlighting the overlooked importance of context relevance in existing studies.

Submitted to arXiv on 27 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.01880v1

In recent years, there has been a growing interest in enhancing the capabilities of Large Language Models (LLMs) to incorporate extremely long external contexts. Two main strategies have emerged: Extending context windows, known as Long Context (LC), and using retrievers to selectively access relevant information, known as Retrieval-Augmented Generation (RAG). This paper delves into recent studies on this topic, shedding light on key insights and discrepancies. One notable advancement in retrieval methods is RAPTOR (Sarthi et al., 2024), which improves accuracy by generating recursive summaries of text chunks organized in a tree structure. By summarizing text segments at various levels and forming a hierarchical tree representing the document's content, RAPTOR enables retrieval models to extract context at varying levels of detail. This method enhances retrieval accuracy for tasks requiring long-range or multi-step reasoning. When it comes to LLMs with extended context capabilities, various models excel in specialized areas. For instance, ChatGLM2-6B-32K focuses on high reasoning efficiency with low memory usage, while XGen-7B-8K enhances conversational understanding and text summarization. InternLM-7B-8k is optimized for knowledge understanding and multilingual translation, while other models like DeepSeek-V2-Chat, Qwen2-72B-Instruct, Mixtral-7x8b, and DBRX-Instruct excel in mathematical computations and logical reasoning. There is a clear trend towards increasing context length in newly released models. These models are categorized based on their supported context windows: short (up to 4K), long (up to 32K), and ultra-long (more than 32K) context models. The advancements in LLMs with extended context capabilities offer significant potential for handling complex questions that require synthesizing information from multiple parts of a document. In conclusion, the trade-offs between RAG and LC strategies underscore the importance of considering context relevance when optimizing LLMs with external knowledge sources. The diverse capabilities of different LLM models highlight the need for tailored approaches based on specific task requirements. Further research in this area can lead to more effective utilization of external knowledge sources for enhancing LLM performance across various applications.
Created on 16 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.