MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

AI-generated keywords: MEDITRON LLMs medical corpus pre-training mix experience replay

AI-generated Key Points

MEDITRON is a suite of open-source large language models (LLMs) designed for the medical domain.
The LLMs have 7B and 70B parameters and are built upon Llama-2 using Nvidia's Megatron-LM distributed trainer.
A comprehensively curated medical corpus, including PubMed articles, abstracts, and internationally recognized medical guidelines, was used to train these models.
The training data covers a wide range of contexts in terms of geographic scope, resource settings, and target audiences.
PubMed papers and abstracts were chosen to create the pre-training mix for MEDITRON due to their vast amount of biomedical textual data.
Metadata information, non-English content, acknowledgments tables figures URLs were removed during preprocessing.
In-text references were formatted using special tokens for accurate citations.
Section headers were indicated with specific tokens for main sections and subsections.
Experience replay was employed in the training process to prevent forgetting previously learned information.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut

arXiv: 2311.16079v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia's Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the best public baseline in its parameter class and 3% over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2. We release our code for curating the medical pretraining corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs.

Submitted to arXiv on 27 Nov. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2311.16079v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In this work, the researchers introduce MEDITRON, a suite of open-source large language models (LLMs) specifically designed for the medical domain. These LLMs have 7B and 70B parameters and are built upon Llama-2 using Nvidia's Megatron-LM distributed trainer. To train these models, a comprehensively curated medical corpus was used, which includes selected PubMed articles, abstracts, and internationally-recognized medical guidelines. The GUIDELINES corpus included in the training data covers a wide range of contexts in terms of geographic scope (from global to institutional), resource settings (high-, low-, and volatile-resource settings), and target audiences (clinicians or patients). It also incorporates various peer review processes ranging from UN bodies to publicly crowdsourced knowledge bases. To create the pre-training mix for MEDITRON, PubMed papers and abstracts were chosen due to their vast amount of biomedical textual data. The researchers collected 4.47M full-text papers from the PubMed Central Open Access Subset as well as 444,521 open-access full-text PubMed papers that were not found in the archive. Additionally, 16,209,047 PubMed and PubMed Central abstracts were collected with a knowledge cutoff date of August 2023. To preprocess the content from these sources metadata information references acknowledgments tables figures URLs and non-English content were removed while inline citations section headers figures tables mathematical formulas were identified using special tokens. In-text references were formatted with a similar methodology to the Galactica model to promote accurate citations by replacing them with [BIB_REF] tokens containing truncated titles and authors' last names while figure and table references were wrapped with [FIG_REF] tokens containing figure numbers and truncated captions. Section headers were indicated with '#' and '##' tokens for main sections and subsections respectively. Experience replay was employed in the training process to overcome catastrophic forgetting which helps prevent the model from forgetting previously learned information by including data from old tasks when training on new tasks.

- MEDITRON is a suite of open-source large language models (LLMs) designed for the medical domain.
- The LLMs have 7B and 70B parameters and are built upon Llama-2 using Nvidia's Megatron-LM distributed trainer.
- A comprehensively curated medical corpus, including PubMed articles, abstracts, and internationally recognized medical guidelines, was used to train these models.
- The training data covers a wide range of contexts in terms of geographic scope, resource settings, and target audiences.
- PubMed papers and abstracts were chosen to create the pre-training mix for MEDITRON due to their vast amount of biomedical textual data.
- Metadata information, non-English content, acknowledgments tables figures URLs were removed during preprocessing.
- In-text references were formatted using special tokens for accurate citations.
- Section headers were indicated with specific tokens for main sections and subsections.
- Experience replay was employed in the training process to prevent forgetting previously learned information.

MEDITRON is a special computer program that helps doctors with medical information. It has been made using a lot of words and rules to understand the medical field better. The program was trained using many different sources like medical articles and guidelines. They took out some extra information from these sources to make it easier to read. They also used a special method called experience replay to remember things they learned before." Definitions- Open-source: A type of computer program that anyone can use and change for free. - Large language models (LLMs): Special computer programs that understand and generate human language. - Parameters: The settings or instructions that tell a computer program how to work. - Corpus: A collection of written or spoken texts used for studying or training a computer program. - Pre-training mix: The combination of different texts used to teach the MEDITRON program. - Metadata: Information about other information, like details about where an article comes from. - In-text references: Special markers in an article that show where certain information comes from. - Section headers: Titles or labels for different parts of an article or document. - Experience replay: A technique used in training programs to help them remember things they have learned before.

Introducing MEDITRON: An Open-Source Suite of Large Language Models for the Medical Domain

In this work, researchers introduce MEDITRON, a suite of open-source large language models (LLMs) specifically designed for the medical domain. These LLMs have 7B and 70B parameters and are built upon Llama-2 using Nvidia's Megatron-LM distributed trainer. To train these models, a comprehensively curated medical corpus was used, which includes selected PubMed articles, abstracts, and internationally-recognized medical guidelines.

The GUIDELINES Corpus

The GUIDELINES corpus included in the training data covers a wide range of contexts in terms of geographic scope (from global to institutional), resource settings (high-, low-, and volatile-resource settings), and target audiences (clinicians or patients). It also incorporates various peer review processes ranging from UN bodies to publicly crowdsourced knowledge bases.

Preprocessing Content from PubMed Sources

To create the pre-training mix for MEDITRON, PubMed papers and abstracts were chosen due to their vast amount of biomedical textual data. The researchers collected 4.47M full-text papers from the PubMed Central Open Access Subset as well as 444,521 open-access full-text PubMed papers that were not found in the archive. Additionally, 16,209,047 PubMed and PubMed Central abstracts were collected with a knowledge cutoff date of August 2023. To preprocess the content from these sources metadata information references acknowledgments tables figures URLs and non-English content were removed while inline citations section headers figures tables mathematical formulas were identified using special tokens. In-text references were formatted with a similar methodology to the Galactica model to promote accurate citations by replacing them with [BIB_REF] tokens containing truncated titles and authors' last names while figure and table references were wrapped with [FIG_REF] tokens containing figure numbers and truncated captions. Section headers were indicated with '#'and '##' tokens for main sections and subsections respectively.

Experience Replay Training Process

Experience replay was employed in the training process to overcome catastrophic forgetting which helps prevent the model from forgetting previously learned information by including data from old tasks when training on new tasks

Created on 05 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 1

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

69.6%

Towards Expert-Level Medical Question Answering with Large Language Models

cs.CL

67.3%

PMC-LLaMA: Further Finetuning LLaMA on Medical Papers

cs.CL

67.1%

Towards Generalist Biomedical AI

cs.CL

63.2%

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domai…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.