SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

AI-generated keywords: Question-answering

AI-generated Key Points

  • Retrieval-augmented generation (RAG) enhances capabilities of large language models (LLMs) by incorporating external knowledge
  • Adapting general-purpose RAG systems to specialized fields like science and medicine poses unique challenges
  • SimRAG is a self-training method that equips LLMs with question answering and generation abilities for domain adaptation
  • SimRAG fine-tunes LLM on various data types, prompts it to generate domain-relevant questions, filters high-quality synthetic examples, and improves performance on domain-specific RAG tasks
  • Experimental results show SimRAG outperforms existing baselines by 1.2% to 8.6%, with comparisons to off-the-shelf domain-specific LLMs and retrieval-augmented LLMs
  • Case studies demonstrate the effectiveness of SimRAG in generating accurate pseudo-labeled QA pairs for tasks like claim verification in textbooks and short-span QA in medical subsets from Wikipedia
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He

Work in Progress
License: CC BY 4.0

Abstract: Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these synthetic examples, the LLM can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three domains, demonstrate that SimRAG outperforms baselines by 1.2\%--8.6\%.

Submitted to arXiv on 23 Oct. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2410.17952v1

, , , , In the realm of question-answering (QA) tasks, retrieval-augmented generation (RAG) has proven to enhance the capabilities of large language models (LLMs) by incorporating external knowledge. However, adapting these general-purpose RAG systems to specialized fields like science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To address this issue, a novel approach called SimRAG has been proposed. SimRAG is a self-training method that equips LLMs with the joint abilities of both question answering and question generation for effective domain adaptation. The process begins with fine-tuning the LLM on various types of data related to instruction-following, question-answering, and search tasks. Subsequently, the same LLM is prompted to generate a diverse set of domain-relevant questions from unlabeled corpora. A filtering strategy is then applied to retain high-quality synthetic examples generated by the model. By leveraging these synthetic examples, the LLM can significantly enhance its performance on domain-specific RAG tasks. Experimental results conducted on 11 datasets across different domains and backbone sizes demonstrate that SimRAG outperforms existing baselines by a margin ranging from 1.2% to 8.6%. The study also includes comparisons with various off-the-shelf domain-specific LLMs as well as general and domain-specific retrieval-augmented LLMs in order to provide a comprehensive evaluation framework. Moreover, additional insights are provided through case studies showcasing the effectiveness of SimRAG in generating accurate pseudo-labeled QA pairs compared to baseline models like Llama3-8B-it. These case studies highlight how SimRAG excels in tasks such as claim verification in textbooks and short-span QA in medical subsets from Wikipedia. Overall, SimRAG emerges as a promising approach for enhancing the adaptability of LLMs in specialized domains through its innovative combination of question answering and generation capabilities.
Created on 12 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.