QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

AI-generated keywords: QMoE Compression Trillion-Parameter Models Mixture-of-Experts Inference Costs

AI-generated Key Points

  • Authors Elias Frantar and Dan Alistarh introduce the QMoE framework to address high inference costs of large language models (LLMs)
  • QMoE utilizes Mixture-of-Experts (MoE) architectures for sparse routing, resulting in faster and more accurate models
  • The framework offers a scalable algorithm that compresses trillion-parameter MoEs to less than 1 bit per parameter in a custom format co-designed with GPU decoding kernels
  • SwitchTransformer-c2048 model can be compressed to less than 160GB (20x compression, 0.8 bits per parameter) with minor accuracy loss achievable in less than a day on a single GPU
  • Enables execution of trillion-parameter models on affordable commodity hardware at less than 5% runtime overhead relative to ideal uncompressed inference
  • Source code and compressed models for QMoE are available on GitHub at github.com/IST-DASLab/qmoe
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Elias Frantar, Dan Alistarh

License: CC BY 4.0

Abstract: Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. The source code and compressed models are available at github.com/IST-DASLab/qmoe.

Submitted to arXiv on 25 Oct. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2310.16795v1

In their paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models," authors Elias Frantar and Dan Alistarh address the challenge of high inference costs associated with large language models (LLMs) by introducing a new compression and execution framework called QMoE. The framework utilizes Mixture-of-Experts (MoE) architectures to reduce inference costs through sparse routing, resulting in faster and more accurate models. However, deploying models like the SwitchTransformer-c2048 with 1.6 trillion parameters is costly and challenging due to the massive amount of accelerator memory required for efficient operation. The QMoE framework offers a scalable algorithm that accurately compresses trillion-parameter MoEs to less than 1 bit per parameter in a custom format co-designed with GPU decoding kernels. This breakthrough enables end-to-end compressed inference with minimal runtime overhead compared to uncompressed execution. Specifically, the SwitchTransformer-c2048 model can be compressed to less than 160GB (20x compression, 0.8 bits per parameter) with only minor accuracy loss, achievable in less than a day on a single GPU. This advancement allows for the execution of trillion-parameter models on affordable commodity hardware such as a single server equipped with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs at less than 5% runtime overhead relative to ideal uncompressed inference. The source code and compressed models for QMoE are available on GitHub at github.com/IST-DASLab/qmoe. Furthermore, the authors discuss the significance of generative large language models in various practical language and reasoning tasks but highlight their high inference costs as a barrier to widespread deployment. By leveraging MoE architectures and innovative compression techniques like those introduced in QMoE, researchers can overcome memory limitations and reliability roadblocks associated with trillion-parameter models, paving the way for more efficient and cost-effective implementation in real-world applications.
Created on 29 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.