Towards Expert-Level Medical Question Answering with Large Language Models

AI-generated keywords: Artificial Intelligence Grand Challenges Language Models Med-PaLM 2 Evaluation Benchmarks

AI-generated Key Points

  • Recent advancements in AI systems have achieved remarkable milestones in various grand challenges, including Go and protein-folding.
  • Large language models (LLMs) have played a significant role in the progress of retrieving medical knowledge, reasoning over it, and answering medical questions comparable to physicians.
  • Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination-style questions on the MedQA dataset.
  • Researchers presented Med-PaLM 2 which leverages a combination of base LLM improvements (PaLM 2), medical domain finetuning and prompting strategies that include a novel ensemble refinement approach.
  • The new model scored up to 86.5% on the MedQA dataset setting a new state-of-the-art.
  • Performance approaching or exceeding state-of-the-art was also observed across other clinical topics datasets like MedMCQA, PubMedQA and MMLU.
  • Detailed human evaluations were performed on long-form questions relevant to clinical applications showing that physicians preferred Med-PaLM 2 answers compared to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001).
  • Potential risks associated with using such models include inaccurate or irrelevant information, omission of critical information, evidence of demographic bias harm to patients and potential answer risks.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahdavi, Joelle Barral, Dale Webster, Greg S. Corrado, Yossi Matias, Shekoofeh Azizi, Alan Karthikesalingam, Vivek Natarajan

License: CC BY 4.0

Abstract: Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

Submitted to arXiv on 16 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.09617v1

Recent advancements in artificial intelligence (AI) systems have achieved remarkable milestones in various grand challenges, including Go and protein-folding. One such grand challenge has been the ability to retrieve medical knowledge, reason over it, and answer medical questions comparable to physicians. Large language models (LLMs) have played a significant role in this progress, with Med-PaLM being the first model to exceed a "passing" score in US Medical Licensing Examination-style questions on the MedQA dataset. To bridge this gap, researchers have presented Med-PaLM 2 which leverages a combination of base LLM improvements (PaLM 2), medical domain finetuning and prompting strategies that include a novel ensemble refinement approach. The new model scored up to 86.5% on the MedQA dataset setting a new state-of-the-art. Performance approaching or exceeding state-of-the-art was also observed across other clinical topics datasets like MedMCQA, PubMedQA and MMLU. Detailed human evaluations were performed on long-form questions relevant to clinical applications showing that physicians preferred Med-PaLM 2 answers compared to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). To evaluate the potential impact of test set contamination on evaluation results researchers searched for overlapping text segments between multiple choice questions in MultiMedQA and the corpus used to train the base LLM underlying Med-PaLM 2. While further studies are necessary to validate the efficacy of these models in real world settings these results highlight rapid progress towards physician level performance in medical question answering. The potential risks associated with using such models include inaccurate or irrelevant information, omission of critical information, evidence of demographic bias harm to patients and potential answer risks.
Created on 18 May. 2023

Assess the quality of the AI-generated content by voting

Score: 1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.