Photon: Federated LLM Pre-Training

AI-generated keywords: Photon Federated Learning Large Language Models Distributed Training Pre-training

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Photon is a system designed for federated end-to-end training of large language models (LLMs) in a distributed environment.
  • It leverages low-bandwidth methods like federated learning (FL) to enable collaborative training across weakly-connected GPUs, with a focus on pre-training.
  • Photon's key strength lies in its robustness to data heterogeneity and its ability to converge twice as fast as previous methods such as DiLoCo.
  • The system achieves enhanced efficiency through small client batch sizes and high learning rates enabled by federated averaging's robustness to hyperparameters.
  • Photon can train model sizes up to 7B in a federated fashion while achieving better perplexity than centralized pre-training methods, showcasing its value for efficient and scalable LLM training solutions.
  • It outperforms baseline distributed training methods in wall-time efficiency by communicating significantly less, resulting in a 35% improvement and reducing model training time with increased compute availability.
  • Photon introduces the first complete system for federated LLM training, utilizing cross-silo FL for global-scale training with minimal communication overheads, enabling the training of decoder-only LLMs from scratch while minimizing bandwidth demands.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lorenzo Sani, Alex Iacob, Zeyu Cao, Royson Lee, Bill Marino, Yan Gao, Dongqi Cai, Zexi Li, Wanru Zhao, Xinchi Qiu, Nicholas D. Lane

13 pages, 9 appendix pages, 10 figures, 3 algorithms, 8 tables

Abstract: Scaling large language models (LLMs) demands extensive data and computing resources, which are traditionally constrained to data centers by the high-bandwidth requirements of distributed training. Low-bandwidth methods like federated learning (FL) could enable collaborative training of larger models across weakly-connected GPUs if they can effectively be used for pre-training. To achieve this, we introduce Photon, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads. Using Photon, we train the first federated family of decoder-only LLMs from scratch. We show that: (1) Photon can train model sizes up to 7B in a federated fashion while reaching an even better perplexity than centralized pre-training; (2) Photon model training time decreases with available compute, achieving a similar compute-time trade-off to centralized; and (3) Photon outperforms the wall-time of baseline distributed training methods by 35% via communicating 64x-512xless. Our proposal is robust to data heterogeneity and converges twice as fast as previous methods like DiLoCo. This surprising data efficiency stems from a unique approach combining small client batch sizes with extremely high learning rates, enabled by federated averaging's robustness to hyperparameters. Photon thus represents the first economical system for global internet-wide LLM pre-training.

Submitted to arXiv on 05 Nov. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2411.02908v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Photon: A Revolutionary System for Federated End-to-End Training of Large Language Models Photon is an innovative system designed to address the challenges of training large language models (LLMs) in a distributed environment. Leveraging low-bandwidth methods like federated learning (FL), Photon enables collaborative training across weakly-connected GPUs, with a focus on pre-training. This groundbreaking approach allows for the first complete system for federated LLM training, resulting in impressive performance metrics. One key strength of Photon lies in its robustness to data heterogeneity and its ability to converge twice as fast as previous methods such as DiLoCo. This enhanced efficiency is achieved through a unique strategy that combines small client batch sizes with exceptionally high learning rates, made possible by the robustness of federated averaging to hyperparameters. Photon's innovative approach and impressive performance metrics make it a valuable tool for researchers and practitioners seeking efficient and scalable solutions for training large language models in distributed environments. With the ability to train model sizes up to 7B in a federated fashion while achieving even better perplexity than centralized pre-training methods, Photon represents a significant milestone in enabling economical global internet-wide pre-training of LLMs. Notably, Photon outperforms baseline distributed training methods in terms of wall-time efficiency by communicating significantly less—up to 64x-512x less—resulting in a 35% improvement. This showcases the system's capability to reduce model training time with increased compute availability, achieving a similar compute-time trade-off as centralized approaches. In a significant advancement, Photon introduces the first complete system for federated LLM training, utilizing cross-silo FL for global-scale training with minimal communication overheads. This allows for the training of the first federated family of decoder-only LLMs from scratch, showcasing Photon's capability to train large models while minimizing bandwidth demands. Traditionally, data centers have been limited to resources required for scaling large language models due to high-bandwidth demands of distributed training. However, Photon's use of low-bandwidth methods like federated learning enables collaborative training across weakly-connected GPUs in distributed environments, making it a groundbreaking system for end-to-end training of LLMs.
Created on 03 Dec. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.