FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

AI-generated keywords: Decentralized Training FusionLLM Directed Acyclic Graph Workload Estimator AdaTopK Compressor

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • System created to address hardware scarcity in training large deep neural networks (DNNs), specifically large language models (LLMs)
  • Designed for decentralized training using geo-distributed GPUs across computing clusters or individual devices
  • Challenges in system design and efficiency include remote automatic differentiation (RAD), flexible model definitions, heterogeneous software support, low resource utilization due to heterogeneous hardware or the straggler problem, and slow network communication
  • DNN model represented as a directed acyclic graph of operators (OP-DAG) for customization without low-level operator implementation concerns
  • DAG runtime executor enables RAD without consistent low-level ML framework versions
  • Workload estimator and OP-Fence scheduler implemented to increase throughput by grouping devices with similar bandwidths and partitioning the DAG
  • AdaTopK compressor proposed to adaptively compress intermediate activations and gradients at slower communication links
  • Experiments conducted on three real-world testbeds using 48 GPUs connected via 8 Mbps~10 Gbps networks showed FusionLLM achieving speedups ranging from 1.45 to 9.39 times compared to baseline methods while ensuring convergence under varying conditions
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhenheng Tang, Xueze Kang, Yiming Yin, Xinglin Pan, Yuxin Wang, Xin He, Qiang Wang, Rongfei Zeng, Kaiyong Zhao, Shaohuai Shi, Amelie Chi Zhou, Bo Li, Bingsheng He, Xiaowen Chu

Abstract: To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.

Submitted to arXiv on 16 Oct. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2410.12707v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The system was created to address the challenges of hardware scarcity in training large deep neural networks (DNNs), specifically large language models (LLMs). It is designed for decentralized training using geo-distributed GPUs across computing clusters or individual devices. This presents challenges in system design and efficiency, such as remote automatic differentiation (RAD), flexible model definitions and heterogeneous software support, low resource utilization due to heterogeneous hardware or the straggler problem, and slow network communication. To overcome these challenges, the system represents the DNN model as a directed acyclic graph of operators (OP-DAG). Each node represents an operator in the DNNs and edges indicate data dependencies between operators. This design allows for customization of any DNN without worrying about low-level operator implementation. It also enables task scheduling with more fine-grained sub-tasks for additional optimization space. Additionally, a DAG runtime executor can implement RAD without requiring consistent low-level ML framework versions. To improve efficiency, a workload estimator has been implemented along with an OP-Fence scheduler that groups devices with similar bandwidths together and partitions the DAG to increase throughput. An AdaTopK compressor has also been proposed to adaptively compress intermediate activations and gradients at slower communication links. Experiments were conducted on three real-world testbeds using 48 GPUs connected via 8 Mbps~10 Gbps networks to evaluate convergence and efficiency of both the system and algorithms used. Results show that FusionLLM can achieve impressive speedups ranging from 1.45 to 9.39 times compared to baseline methods while ensuring convergence under varying conditions.
Created on 17 Oct. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.