MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
AI-generated Key Points
- Authors address limitations of existing RAG methods in answering multi-hop queries
- Introduce a novel dataset called MultiHop-RAG consisting of knowledge base, multi-hop queries, ground-truth answers, and supporting evidence
- Conduct two experiments: comparing different embedding models for retrieving evidence and examining capabilities of various language models in reasoning and answering multi-hop queries
- Existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries
- Categorize multi-hop queries into four types: Inference query, Comparison query, Temporal query, and Null query
- Construct RAG knowledge base by extracting factual sentences from news articles and generating claims using GPT-4 with disambiguated topics and entities as bridges
- MultiHop-RAG provides a resource for developing effective RAG systems and evaluating language models' reasoning capabilities
- Dataset and implemented RAG system are publicly available for the community to access.
Authors: Yixuan Tang, Yi Yang
Abstract: Retrieval-augmented generation (RAG) augments large language models (LLM) by retrieving relevant knowledge, showing promising potential in mitigating LLM hallucinations and enhancing response quality, thereby facilitating the great adoption of LLMs in practice. However, we find that existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence. Furthermore, to our knowledge, no existing RAG benchmarking dataset focuses on multi-hop queries. In this paper, we develop a novel dataset, MultiHop-RAG, which consists of a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. We detail the procedure of building the dataset, utilizing an English news article dataset as the underlying RAG knowledge base. We demonstrate the benchmarking utility of MultiHop-RAG in two experiments. The first experiment compares different embedding models for retrieving evidence for multi-hop queries. In the second experiment, we examine the capabilities of various state-of-the-art LLMs, including GPT-4, PaLM, and Llama2-70B, in reasoning and answering multi-hop queries given the evidence. Both experiments reveal that existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries. We hope MultiHop-RAG will be a valuable resource for the community in developing effective RAG systems, thereby facilitating greater adoption of LLMs in practice. The MultiHop-RAG and implemented RAG system is publicly available at https://github.com/yixuantt/MultiHop-RAG/.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Look for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.