OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

AI-generated keywords: Scientific Research

AI-generated Key Points

  • OpenScholar is a retrieval-augmented LM designed for scientific queries, utilizing a repository of 45 million open-access papers.
  • ScholarQABench is a multi-domain benchmark for literature search, comprising expert-crafted queries and detailed answers in computer science, physics, neuroscience, and biomedicine.
  • OpenScholar outperforms existing models like GPT-4o and PaperQA2 in correctness by 5% and 7% respectively.
  • OpenScholar demonstrates high citation accuracy comparable to human experts, unlike GPT-4o which fabricates citations up to 90% of the time.
  • OpenScholar enhances off-the-shelf LMs' performance significantly through innovative mechanisms such as its datastore, retriever, and self-feedback inference loop.
  • Human evaluations show that experts prefer responses generated by OpenScholar over expert-written ones in a significant percentage of cases.
  • The team behind OpenScholar has open-sourced all their resources including code, models, datastore information, data sets, and provided a public demo.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen-tau Yih, Pang Wei Koh, Hannaneh Hajishirzi

License: CC BY 4.0

Abstract: Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.

Submitted to arXiv on 21 Nov. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2411.14199v1

In the realm of scientific research, the ability to synthesize a vast body of literature is crucial for progress. Can large language models (LMs) aid researchers in this endeavor? Enter OpenScholar, a specialized retrieval-augmented LM designed to address scientific queries by identifying pertinent passages from a repository of 45 million open-access papers and generating citation-backed responses. To assess OpenScholar's efficacy, ScholarQABench was developed as the first large-scale multi-domain benchmark for literature search. This benchmark comprises 2,967 expert-crafted queries and 208 detailed answers spanning computer science, physics, neuroscience, and biomedicine. In comparison to existing models like GPT-4o and PaperQA2, OpenScholar-8B outperforms them by 5% and 7% in correctness respectively, despite being a smaller open model. Notably, while GPT-4o tends to fabricate citations up to 90% of the time, OpenScholar demonstrates citation accuracy on par with human experts. Furthermore, through its innovative datastore, retriever, and self-feedback inference loop mechanisms, OpenScholar enhances off-the-shelf LMs' performance significantly; for instance, OpenScholar-GPT4o boosts GPT-4o's correctness by an impressive 12%. Human evaluations reveal that experts favor responses generated by OpenScholar-8B and OpenScholar-GPT4o over expert-written ones in 51% and 70% of cases respectively compared to GPT-4o's preference rate of 32%. The team behind this groundbreaking work has generously open-sourced all their resources including code, models, datastore information, data sets along with providing a public demo. Expanding on this achievement further is the introduction of three new long-form QA datasets annotated by domain experts for multi-paper tasks across four scientific disciplines. SCHOLARQA-CS focuses on various computer science topics requiring multiple research papers for comprehensive answers. The creation of SCHOLARQABENCH addresses the need for reliable evaluation pipelines in complex scenarios where realistic queries demand multi-paper retrieval and reasoning capabilities. The challenges faced in building high-quality benchmarks for literature review are acknowledged; however SCHOLARQABENCH rises above these hurdles with its diverse formats supporting scientific literature synthesis tasks such as closed-form classification, multiple-choice questions as well as long-form generation. With a focus on diversity in tasks and disciplines spanning computer science, biomedicine physics and neuroscience SCHOLARQABENCH sets a new standard in evaluating model capabilities in automating scientific literature review processes.
Created on 27 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.