Fairness in Serving Large Language Models
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- The paper focuses on ensuring fairness in serving high-demand LLM inference services like ChatGPT and BARD.
- Most major LLM inference services implement request rate limits to prevent one client from dominating the queue, but this can lead to under-utilization of resources and subpar client experiences.
- Serving LLMs introduces complexities due to unpredictable request lengths and unique batching characteristics on parallel accelerators.
- The paper introduces a novel concept of LLM serving fairness based on a cost function considering input and output tokens processed.
- The authors propose a scheduling algorithm called Virtual Token Counter (VTC) for efficient and fair client service.
- Research proves a 2x tight upper bound on service difference between backlogged clients while adhering to work-conserving principles.
- Extensive experiments show superior performance of VTC in ensuring fairness across various conditions, enhancing resource utilization, and improving client experiences.
Authors: Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica
Abstract: High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2x tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.