EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

AI-generated keywords: EdgeMoE LLMs Memory Compute-I/O Pipeline Bitwidth Adaptation

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • EdgeMoE is an on-device inference engine for mixture-of-expert (MoE) Large Language Models (LLMs)
  • LLMs have exceptional capabilities in machine learning tasks
  • Transitioning LLMs from data centers to edge devices can enhance privacy and availability
  • Large parameter sizes of LLMs result in impractical runtime costs on edge devices
  • EdgeMoE strategically partitions the model across the storage hierarchy for memory and computational efficiency
  • Non-expert weights are stored in device memory, while expert weights are kept in external storage and fetched into memory only when activated
  • Expert weights are infrequently accessed due to sparse activation patterns
  • EdgeMoE uses expert-wise bitwidth adaptation to reduce expert weight size without significant accuracy loss
  • Expert management predicts which experts will be activated in advance and preloads them into the compute-I/O pipeline
  • Empirical evaluations show that EdgeMoE achieves substantial memory savings and performance improvements compared to baseline solutions
  • EdgeMoE provides a specialized on-device inference engine tailored for MoE-based models
  • Strategic partitioning and innovative techniques improve efficiency, performance, and minimize runtime costs.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, Mengwei Xu

Abstract: Large Language Models (LLMs) such as GPTs and LLaMa have ushered in a revolution in machine intelligence, owing to their exceptional capabilities in a wide range of machine learning tasks. However, the transition of LLMs from data centers to edge devices presents a set of challenges and opportunities. While this shift can enhance privacy and availability, it is hampered by the enormous parameter sizes of these models, leading to impractical runtime costs. In light of these considerations, we introduce EdgeMoE, the first on-device inference engine tailored for mixture-of-expert (MoE) LLMs, a popular variant of sparse LLMs that exhibit nearly constant computational complexity as their parameter size scales. EdgeMoE achieves both memory and computational efficiency by strategically partitioning the model across the storage hierarchy. Specifically, non-expert weights are stored in the device's memory, while expert weights are kept in external storage and are fetched into memory only when they are activated. This design is underpinned by a crucial insight that expert weights, though voluminous, are infrequently accessed due to sparse activation patterns. To further mitigate the overhead associated with expert I/O swapping, EdgeMoE incorporates two innovative techniques: (1) Expert-wise bitwidth adaptation: This method reduces the size of expert weights with an acceptable level of accuracy loss. (2) Expert management: It predicts the experts that will be activated in advance and preloads them into the compute-I/O pipeline, thus further optimizing the process. In empirical evaluations conducted on well-established MoE LLMs and various edge devices, EdgeMoE demonstrates substantial memory savings and performance improvements when compared to competitive baseline solutions.

Submitted to arXiv on 28 Aug. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2308.14352v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

EdgeMoE is an on-device inference engine designed for mixture-of-expert (MoE) Large Language Models (LLMs), which are known for their exceptional capabilities in various machine learning tasks. The transition of LLMs from data centers to edge devices can enhance privacy and availability but is hindered by the large parameter sizes of these models, resulting in impractical runtime costs. To address these issues, EdgeMoE strategically partitions the model across the storage hierarchy to achieve memory and computational efficiency. Non-expert weights are stored in the device's memory while expert weights are kept in external storage and fetched into memory only when they are activated. This design is based on the insight that expert weights, although voluminous, are infrequently accessed due to sparse activation patterns. Additionally, EdgeMoE incorporates two innovative techniques to mitigate the overhead associated with expert input/output swapping: expert-wise bitwidth adaptation to reduce the size of expert weights without significant accuracy loss and expert management to predict which experts will be activated in advance and preloads them into the compute-I/O pipeline. Empirical evaluations conducted on well-established MoE LLMs and various edge devices demonstrate that EdgeMoE achieves substantial memory savings and performance improvements compared to competitive baseline solutions. Overall, EdgeMoE addresses the challenges of deploying LLMs on edge devices by providing a specialized on-device inference engine tailored for MoE-based models. Its strategic partitioning of model components and innovative techniques result in improved efficiency and performance while minimizing runtime costs.
Created on 21 Oct. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.