Efficient Memory Management for Large Language Model Serving with PagedAttention

AI-generated keywords: Efficient Memory Management Large Language Model Serving PagedAttention vLLM Resource Utilization

AI-generated Key Points

  • Research paper presented at SOSP 2023 by Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica
  • Proposes PagedAttention algorithm for efficient serving of large language models (LLMs)
  • Algorithm inspired by virtual memory and paging techniques in operating systems
  • Implemented in vLLM system to minimize waste in key-value cache (KV cache) memory
  • Enables flexible sharing of KV cache within and across requests to reduce memory usage
  • Demonstrates significant improvements in throughput for popular LLMs compared to state-of-the-art systems
  • Achieves near-zero waste in KV cache memory with a 2-4x increase in throughput while maintaining latency
  • Sets a new standard for efficient large language model serving with effective resource utilization
  • Source code for vLLM available on GitHub at https://github.com/vllm-project/vllm
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica

SOSP 2023
License: CC BY 4.0

Abstract: High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm

Submitted to arXiv on 12 Sep. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2309.06180v1

Efficient Memory Management for Large Language Model Serving with PagedAttention is a research paper presented at SOSP 2023 by Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. The paper addresses the challenge of high throughput serving of large language models (LLMs) by proposing PagedAttention, an attention algorithm inspired by virtual memory and paging techniques in operating systems. This algorithm is implemented in vLLM, an LLM serving system that minimizes waste in key-value cache (KV cache) memory and enables flexible sharing of KV cache within and across requests to reduce memory usage. for with is a research paper presented at SOSP 2023 by Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng,Cody Hao Yu,Joseph E.Gonzalez,Hao Zhang,and Ion Stoica.The paper proposes a novel solution to the challenge of high throughput serving of large language models (LLMs). It introduces , an LLM serving system that utilizes an attention algorithm called . This algorithm draws inspiration from virtual memory and paging techniques used in operating systems. Existing systems struggle with inefficient management of KV cache memory due to its dynamic growth and shrinkage leading to fragmentation and redundant duplication. This limitation hinders batching sufficiently many requests at a time. However,< kd>PagedAttention</ kd >and< kd>vLLM</ kd > address this issue by demonstrating significant improvements in throughput for popular LLMs compared to state-of-the-art systems like FasterTransformer and Orca. The enhancements are particularly notable with longer sequences, larger models, and complex decoding algorithms. The paper highlights the near-zero waste achieved in KV cache memory through< kd>vLLM</ kd >'s approach, showcasing a 2-4x increase in throughput while maintaining the same level of latency. By leveraging memory sharing opportunities effectively and optimizing resource utilization within the system architecture,< kd>vLLM</ kd > sets a new standard for efficient large language model serving. The source code for< kd>vLLM</ kd > is publicly available on GitHub at https://github.com/vllm-project/vllm. is maximized with and , resulting in significant improvements in throughput for large language models. This research paper presents a novel solution to the challenge of high throughput serving by introducing an attention algorithm inspired by virtual memory and paging techniques used in operating systems. With its implementation in this algorithm minimizes waste in key-value cache (KV cache) memory and enables flexible sharing within and across requests, reducing overall memory usage. Through its effective management of resources, sets a new standard for efficient large language model serving, showcasing impressive results even with longer sequences, larger models, and complex decoding algorithms.
Created on 19 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.