Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study

AI-generated keywords: Lab results Large language models ChatGPT LLM-based evaluator Medical experts

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Lab results can be confusing for patients, highlighting a need for accessible and accurate information.
  • Large language models (LLMs) like ChatGPT are being used to provide relevant responses to lab test-related questions.
  • A recent study evaluated the effectiveness of LLMs in generating responses by analyzing 53 QA pairs from Yahoo! Answers.
  • Four different LLMs, including GPT-4, were assessed using standard QA evaluation metrics.
  • GPT-4's responses outperformed other LLMs and human responses in terms of accuracy, helpfulness, relevance, and safety.
  • However, some LLM responses lacked interpretation within a medical context or contained incorrect statements.
  • The study identified areas for improvement in enhancing the quality of LLM-generated responses.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhe He, Balu Bhasuran, Qiao Jin, Shubo Tian, Karim Hanna, Cindy Shavor, Lisbeth Garcia Arguello, Patrick Murray, Zhiyong Lu

License: CC BY-NC-ND 4.0

Abstract: Lab results are often confusing and hard to understand. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to get their questions answered. We aim to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to lab test-related questions asked by patients and to identify potential issues that can be mitigated with augmentation approaches. We first collected lab test results related question and answer data from Yahoo! Answers and selected 53 QA pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their answers using standard QA similarity-based evaluation metrics including ROUGE, BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge whether a target model has higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. Finally, we performed a manual evaluation with medical experts for all the responses to seven selected questions on the same four aspects. The results of Win Rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all four aspects (relevance, correctness, helpfulness, and safety). However, LLM responses occasionally also suffer from a lack of interpretation in one's medical context, incorrect statements, and lack of references. We find that compared to other three LLMs and human answer from the Q&A website, GPT-4's responses are more accurate, helpful, relevant, and safer. However, there are cases which GPT-4 responses are inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses.

Submitted to arXiv on 23 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.01693v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Lab results can be confusing and difficult for patients to understand, leading to a need for accessible and accurate information. Large language models (LLMs) like ChatGPT have emerged as a promising tool to address this issue by providing relevant and helpful responses to lab test-related questions. In a recent study, researchers aimed to evaluate the effectiveness of using LLMs in generating responses to such queries from patients. The study involved collecting question and answer data related to lab test results from Yahoo! Answers and selecting 53 QA pairs for analysis. Using the LangChain framework and ChatGPT web portal, responses were generated by four different LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. The similarity of their answers was assessed using standard QA evaluation metrics such as ROUGE, BLEU, METEOR, and BERTScore. An LLM-based evaluator was also employed to determine the quality of responses in terms of relevance, correctness, helpfulness, and safety compared to baseline models. Subsequently, a manual evaluation involving medical experts was conducted on seven selected questions across all four aspects. The results indicated that GPT-4's responses outperformed other LLMs and human responses in terms of accuracy, helpfulness, relevance, and safety. However,it was noted that LLM responses occasionally lacked interpretation within a medical context or contained incorrect statements or lacked references.Despite the overall superiority of GPT-4's responses in comparison to other models and human answers from Q&A websites,the study identified several areas for improvement in enhancing the quality of LLM-generated responses. The authors of this evaluation study include Zhe He, Balu Bhasuran, Qiao Jin, Shubo Tian,Karim Hanna,Cindy Shavor,Lisbeth Garcia Arguello ,Patrick Murray,and Zhiyong Lu. The full paper can be accessed via the provided link for further details on their methodology and findings in evaluating the quality of LLM responses in interpreting lab test results for lay patients.
Created on 06 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.