Optimizing Distributed Training on Frontier for Large Language Models

AI-generated keywords: Large Language Models Distributed Training Hyperparameter Tuning Frontier Supercomputer AMD-Powered

AI-generated Key Points

Large language models (LLMs) have shown remarkable success as foundation models
Fine-tuning LLMs has proven beneficial for various downstream applications
Training LLMs with billions of parameters requires significant computational resources
Efficient distributed strategies such as tensor parallelism, pipeline parallelism, and sharded data parallelism are explored to train a trillion-parameter model on the Frontier exascale supercomputer
The authors analyze these distributed training techniques individually to determine which ones to use and what associated parameters to select
Hyperparameter tuning is performed to understand the complex interplay between these techniques
Optimal strategies for training models of different sizes (22B, 175B, and 1T parameters) are identified
Achieved 100% weak scaling efficiency for the 175B parameter model and 89% and 87% strong scaling efficiency for the 1T model respectively
Strategies for distributed training of LLMs based on experimental findings and hyperparameter tuning are presented
Focus on finding the best strategies to train large models on Frontier by combining tensor parallelism, pipeline parallelism, micro-batch size, and gradient accumulation steps
Empirical analysis of multiple distribution strategies and valuable observations for training a 22B model are provided
Hyperparameter tuning results reported for a 175B model and a training recipe devised for both the 175B and 1T models
Enables state-of-the-art distributed training algorithms on AMD hardware using the ROCM software platform
Serves as a blueprint for efficient training of LLMs on non-NVIDIA and non-CUDA platforms like the AMD-powered Frontier supercomputer
Presents an optimized distributed training strategy achieved through hyperparameter search effectively managing GPU memory wall and communication latency.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sajal Dash, Isaac Lyngaas, Junqi Yin, Xiao Wang, Romain Egele, Guojing Cong, Feiyi Wang, Prasanna Balaprakash

arXiv: 2312.12705v1 - DOI (cs.DC)

License: CC BY 4.0

Abstract: Large language models (LLM) are showing tremendous success as foundation models, and many downstream applications benefit from fine-tuning. Prior works on loss scaling have demonstrated that the larger LLMs perform better than their smaller counterparts. However, training LLMs with billions of parameters requires considerable computational resources; to train a one trillion GPT-style model on 20 trillion tokens, we need to perform 120 million exaflops. Frontier is the world's first and fastest exascale supercomputer for open science and is equipped with 75264 MI250X GPUs. This work explores efficient distributed strategies such as tensor parallelism, pipeline parallelism, and sharded data parallelism to train a trillion-parameter model on the Frontier exascale supercomputer. We analyze these distributed training techniques and associated parameters individually to decide which techniques to use and what associated parameters to select for a particular technique. We perform hyperparameter tuning on these techniques to understand their complex interplay. Combined with these two tuning efforts, we have found optimal strategies to train three models of size 22B, 175B, and 1T parameters with $38.38\%$ , $36.14\%$ , and $31.96\%$ achieved throughput. For training the 175B parameter model and 1T model, we have achieved $100\%$ weak scaling efficiency and $89\%$ and $87\%$ strong scaling efficiency, respectively. Our work presents a set of strategies for distributed training of LLMs through experimental findings and hyperparameter tuning.

Submitted to arXiv on 20 Dec. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2312.12705v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Large language models (LLMs) have shown remarkable success as foundation models and fine-tuning them has proven beneficial for various downstream applications. However, training LLMs with billions of parameters requires significant computational resources. To address this challenge, the authors explore efficient distributed strategies such as tensor parallelism, pipeline parallelism and sharded data parallelism to train a trillion-parameter model on the Frontier exascale supercomputer. The authors analyze these distributed training techniques individually to determine which ones to use and what associated parameters to select. They also perform hyperparameter tuning to understand the complex interplay between these techniques. Through their research efforts, they have identified optimal strategies for training models of different sizes: 22B, 175B and 1T parameters. For the 175B parameter model and 1T model, the authors achieved 100% weak scaling efficiency and 89% and 87% strong scaling efficiency respectively. They present a set of strategies for distributed training of LLMs based on experimental findings and hyperparameter tuning. The authors focus on finding the best strategies to train large models on Frontier by combining tensor parallelism, pipeline parallelism, micro-batch size and gradient accumulation steps. They examine each component in detail to optimize them for Frontier's infrastructure. Their objective is not to achieve the highest possible accuracy but rather to enhance the performance characteristics of training processes on HPC systems. The paper outlines various distribution strategies and evaluates their cost for training large LLMs on Frontier. It provides an empirical analysis of multiple distribution strategies and identifies valuable observations for training a 22B model. The authors also report hyperparameter tuning results for a 175B model and devise a training recipe for both the 175B and 1T models. The contributions of this work include enabling state-of-the-art distributed training algorithms on AMD hardware using the ROCM software platform. It serves as a blueprint for efficient training of LLMs on non-NVIDIA and non-CUDA platforms like the AMD-powered Frontier supercomputer. Additionally, the research presents an optimized distributed training strategy achieved through hyperparameter search effectively managing GPU memory wall and communication latency.

- Large language models (LLMs) have shown remarkable success as foundation models
- Fine-tuning LLMs has proven beneficial for various downstream applications
- Training LLMs with billions of parameters requires significant computational resources
- Efficient distributed strategies such as tensor parallelism, pipeline parallelism, and sharded data parallelism are explored to train a trillion-parameter model on the Frontier exascale supercomputer
- The authors analyze these distributed training techniques individually to determine which ones to use and what associated parameters to select
- Hyperparameter tuning is performed to understand the complex interplay between these techniques
- Optimal strategies for training models of different sizes (22B, 175B, and 1T parameters) are identified
- Achieved 100% weak scaling efficiency for the 175B parameter model and 89% and 87% strong scaling efficiency for the 1T model respectively
- Strategies for distributed training of LLMs based on experimental findings and hyperparameter tuning are presented
- Focus on finding the best strategies to train large models on Frontier by combining tensor parallelism, pipeline parallelism, micro-batch size, and gradient accumulation steps
- Empirical analysis of multiple distribution strategies and valuable observations for training a 22B model are provided
- Hyperparameter tuning results reported for a 175B model and a training recipe devised for both the 175B and 1T models
- Enables state-of-the-art distributed training algorithms on AMD hardware using the ROCM software platform
- Serves as a blueprint for efficient training of LLMs on non-NVIDIA and non-CUDA platforms like the AMD-powered Frontier supercomputer
- Presents an optimized distributed training strategy achieved through hyperparameter search effectively managing GPU memory wall and communication latency.

Large language models (LLMs) are computer models that are used for various applications and have been very successful. Fine-tuning LLMs means making small adjustments to improve their performance in specific tasks. Training LLMs with billions of parameters requires a lot of computational resources, like powerful computers. Different strategies, such as tensor parallelism and pipeline parallelism, are being explored to train even bigger models on supercomputers. The authors of the study analyzed these strategies individually to see which ones work best and how to set them up properly. Hyperparameter tuning is when you adjust certain settings to find the best combination of strategies. They found the best ways to train different sizes of models and achieved good efficiency in scaling them up. The study focuses on finding the best strategies for training large models on a specific supercomputer called Frontier, using AMD hardware."

Efficient Distributed Training of Large Language Models on Frontier

Large language models (LLMs) have become increasingly popular in recent years due to their success as foundation models and the benefits of fine-tuning them for various downstream applications. However, training LLMs with billions of parameters requires significant computational resources. To address this challenge, researchers from the University of Tennessee and Oak Ridge National Laboratory explored efficient distributed strategies such as tensor parallelism, pipeline parallelism and sharded data parallelism to train a trillion-parameter model on the Frontier exascale supercomputer. In their paper “Efficient Distributed Training of Large Language Models on Frontier”, they analyze these distributed training techniques individually to determine which ones to use and what associated parameters to select. They also perform hyperparameter tuning to understand the complex interplay between these techniques. Through their research efforts, they have identified optimal strategies for training models of different sizes: 22B, 175B and 1T parameters.

Distribution Strategies

The authors focus on finding the best strategies to train large models on Frontier by combining tensor parallelism, pipeline parallelism, micro-batch size and gradient accumulation steps. They examine each component in detail to optimize them for Frontier's infrastructure. Their objective is not necessarily achieving highest possible accuracy but rather enhancing performance characteristics of training processes on HPC systems. The paper outlines various distribution strategies and evaluates their cost for training large LLMs on Frontier. It provides an empirical analysis of multiple distribution strategies and identifies valuable observations for training a 22B model.

Results

For the 175B parameter model and 1T model, the authors achieved 100% weak scaling efficiency and 89% strong scaling efficiency respectively when using optimized distribution strategies with hyperparameter search effectively managing GPU memory wall and communication latency issues that arise during distributed training processses . The authors present a set of strategies based on experimental findings along with hyperparameter tuning results for a 175B model as well as devise a recipe for both 175B & 1T models that can be used by other researchers interested in similar experiments or projects related to distributed deep learning tasks over HPC systems like AMD powered frontier supercomputer .

Contributions

The contributions from this work include enabling state-of-the-art distributed algorithms over AMD hardware using ROCM software platform which serves as blueprint for efficient training of LLMs over non NVIDIA/CUDA platforms like AMD powered frontier supercomputer . Additionally , it presents an optimized distributed strategy through its hyperparameter search effectively managing GPU memory wall & communication latency issues arising during distributed training processes .

Created on 03 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

67.2%

GLM-130B: An Open Bilingual Pre-trained Model

cs.CL

67.0%

ZeRO-Offload: Democratizing Billion-Scale Model Training

cs.DC

66.4%

PaLM: Scaling Language Modeling with Pathways

cs.CL

66.4%

CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

cs.NI

66.2%

Efficiently Scaling Transformer Inference

cs.LG

65.2%

Zero-Shot Text-to-Image Generation

cs.CV

63.4%

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.