FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

AI-generated keywords: Language Models Efficiency Throughput Hardware Configurations Processing Efficiency

AI-generated Key Points

  • Authors propose innovative solutions to improve efficiency and throughput of large language models (LLMs) by addressing high cost
  • Introduce use of multiple out-of-chassis remote CPUs for KV-cache and related computations, scaling up memory capacity and bandwidth to better utilize GPU resources
  • Develop sequence-level load-stabilizing schedule to minimize idling and optimize workload distribution between CPUs and GPUs
  • Adopt model-guided approach to orchestrate CPU-GPU interactions, with aggregated memory bandwidth as key metric in selecting CPUs for optimal performance
  • Contributions include novel decomposition of auto-regressive transformer models using near-memory processing over KV-cache with out-of-chassis CPUs for increased throughput
  • Sequence-level pipeline schedule balances workload variations in token generation using LLMs
  • Create performance model that provides optimal hardware configurations based on different model requirements
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jiaao He, Jidong Zhai

15 pages, 15 figures
License: CC BY-NC-SA 4.0

Abstract: Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some constantly reused intermediate results, namely KV-Cache. They occupy too much memory to fit more sequences into a GPU simultaneously. While they could be offloaded to host memory, the CPU-GPU bandwidth is an inevitable bottleneck. We find a way to decompose the transformer models into two parts of different characteristics, one of which includes the memory-bound KV-Cache accessing. Our key insight is that the aggregated memory capacity, bandwidth, and computing power of CPUs across multiple nodes is an efficient option to process this part. Performance improvement comes from reduced data transmission overhead and boosted GPU throughput to process the other model part. Moreover, we address efficiency challenges brought by heterogeneity at both temporal and inter-device scopes using scheduling and performance modeling techniques. Evaluation results show that our system achieves 1.88x - 5.04x the throughput of vLLM when serving modern LLMs with the same GPU.

Submitted to arXiv on 18 Mar. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2403.11421v1

The authors propose innovative solutions to improve the efficiency and throughput of serving large language models (LLMs) by addressing their high cost. They introduce the use of multiple out-of-chassis remote CPUs for KV-cache and related computations, scaling up memory capacity and bandwidth to better utilize GPU resources. Additionally, they develop a sequence-level load-stabilizing schedule to minimize idling and optimize workload distribution between CPUs and GPUs. A model-guided approach is adopted to orchestrate CPU-GPU interactions, with aggregated memory bandwidth identified as a key metric in selecting CPUs for optimal performance. The contributions include a novel decomposition of auto-regressive transformer models using near-memory processing over KV-cache with out-of-chassis CPUs for increased throughput. A sequence-level pipeline schedule balances workload variations in token generation using LLMs. The authors also create a performance model that provides optimal hardware configurations based on different model requirements. The paper covers background information on LLMs and hardware options, the proposed decomposition approach, system design details addressing heterogeneity challenges, implementation specifics, performance comparisons with other systems, experimental results analysis, discussion on related works, and conclusions. Overall, for serving modern LLMs with affordable GPU resources.
Created on 01 Jul. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.