Optimizing Distributed Training on Frontier for Large Language Models

AI-generated keywords: Large Language Models Distributed Training Hyperparameter Tuning Frontier Supercomputer AMD-Powered

AI-generated Key Points

  • Large language models (LLMs) have shown remarkable success as foundation models
  • Fine-tuning LLMs has proven beneficial for various downstream applications
  • Training LLMs with billions of parameters requires significant computational resources
  • Efficient distributed strategies such as tensor parallelism, pipeline parallelism, and sharded data parallelism are explored to train a trillion-parameter model on the Frontier exascale supercomputer
  • The authors analyze these distributed training techniques individually to determine which ones to use and what associated parameters to select
  • Hyperparameter tuning is performed to understand the complex interplay between these techniques
  • Optimal strategies for training models of different sizes (22B, 175B, and 1T parameters) are identified
  • Achieved 100% weak scaling efficiency for the 175B parameter model and 89% and 87% strong scaling efficiency for the 1T model respectively
  • Strategies for distributed training of LLMs based on experimental findings and hyperparameter tuning are presented
  • Focus on finding the best strategies to train large models on Frontier by combining tensor parallelism, pipeline parallelism, micro-batch size, and gradient accumulation steps
  • Empirical analysis of multiple distribution strategies and valuable observations for training a 22B model are provided
  • Hyperparameter tuning results reported for a 175B model and a training recipe devised for both the 175B and 1T models
  • Enables state-of-the-art distributed training algorithms on AMD hardware using the ROCM software platform
  • Serves as a blueprint for efficient training of LLMs on non-NVIDIA and non-CUDA platforms like the AMD-powered Frontier supercomputer
  • Presents an optimized distributed training strategy achieved through hyperparameter search effectively managing GPU memory wall and communication latency.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sajal Dash, Isaac Lyngaas, Junqi Yin, Xiao Wang, Romain Egele, Guojing Cong, Feiyi Wang, Prasanna Balaprakash

License: CC BY 4.0

Abstract: Large language models (LLM) are showing tremendous success as foundation models, and many downstream applications benefit from fine-tuning. Prior works on loss scaling have demonstrated that the larger LLMs perform better than their smaller counterparts. However, training LLMs with billions of parameters requires considerable computational resources; to train a one trillion GPT-style model on 20 trillion tokens, we need to perform 120 million exaflops. Frontier is the world's first and fastest exascale supercomputer for open science and is equipped with 75264 MI250X GPUs. This work explores efficient distributed strategies such as tensor parallelism, pipeline parallelism, and sharded data parallelism to train a trillion-parameter model on the Frontier exascale supercomputer. We analyze these distributed training techniques and associated parameters individually to decide which techniques to use and what associated parameters to select for a particular technique. We perform hyperparameter tuning on these techniques to understand their complex interplay. Combined with these two tuning efforts, we have found optimal strategies to train three models of size 22B, 175B, and 1T parameters with $38.38\%$ , $36.14\%$ , and $31.96\%$ achieved throughput. For training the 175B parameter model and 1T model, we have achieved $100\%$ weak scaling efficiency and $89\%$ and $87\%$ strong scaling efficiency, respectively. Our work presents a set of strategies for distributed training of LLMs through experimental findings and hyperparameter tuning.

Submitted to arXiv on 20 Dec. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2312.12705v1

Large language models (LLMs) have shown remarkable success as foundation models and fine-tuning them has proven beneficial for various downstream applications. However, training LLMs with billions of parameters requires significant computational resources. To address this challenge, the authors explore efficient distributed strategies such as tensor parallelism, pipeline parallelism and sharded data parallelism to train a trillion-parameter model on the Frontier exascale supercomputer. The authors analyze these distributed training techniques individually to determine which ones to use and what associated parameters to select. They also perform hyperparameter tuning to understand the complex interplay between these techniques. Through their research efforts, they have identified optimal strategies for training models of different sizes: 22B, 175B and 1T parameters. For the 175B parameter model and 1T model, the authors achieved 100% weak scaling efficiency and 89% and 87% strong scaling efficiency respectively. They present a set of strategies for distributed training of LLMs based on experimental findings and hyperparameter tuning. The authors focus on finding the best strategies to train large models on Frontier by combining tensor parallelism, pipeline parallelism, micro-batch size and gradient accumulation steps. They examine each component in detail to optimize them for Frontier's infrastructure. Their objective is not to achieve the highest possible accuracy but rather to enhance the performance characteristics of training processes on HPC systems. The paper outlines various distribution strategies and evaluates their cost for training large LLMs on Frontier. It provides an empirical analysis of multiple distribution strategies and identifies valuable observations for training a 22B model. The authors also report hyperparameter tuning results for a 175B model and devise a training recipe for both the 175B and 1T models. The contributions of this work include enabling state-of-the-art distributed training algorithms on AMD hardware using the ROCM software platform. It serves as a blueprint for efficient training of LLMs on non-NVIDIA and non-CUDA platforms like the AMD-powered Frontier supercomputer. Additionally, the research presents an optimized distributed training strategy achieved through hyperparameter search effectively managing GPU memory wall and communication latency.
Created on 03 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.