Links table
One summary and introduction
2 Background and 2.1 Large compiler-based language models
2.2 LLM service and autoregressive generation
2.3 Mixing techniques for LLM
3 Memory challenges in LLM service
3.1 Memory management in existing systems
4 Method and 4.1 PagedAttention
4.2KV Cache Manager
4.3 Decoding using PagedAttention and vLLM
4.4 Application to other decoding scenarios
4.5 Scheduling and pre-emption
4.6 Distributed execution
5 Implementation
6 Evaluation and 6.1 Experimental setup
6.2 Core sampling
6.3 Parallel sampling and radial search
6.4 Common prefix
6.5 Chatbot
7 ablation studies
8 Discussion
9 related works
10 Conclusion, appreciation and references
6.5 Chatbot
Chat bot [8, 19, 35] It is one of the most important applications of LLMs. To implement a chatbot, we allow the form to generate a response by associating the chat history and the user’s last query into the prompt. We collect chat history and user query using ShareGPT dataset. Due to the limited context length of the OPT-13B model, we truncate the vector to the last 1024 tokens and allow the model to generate at most 1024 tokens. We do not cache KV between different conversation rounds because doing so would take up space for other requests between conversation rounds.
Figure 17 shows that vLLM can sustain 2× higher request rates compared to the three Orca baselines. Since the ShareGPT dataset contains many long conversations, the input prompts for most requests contain 1024 tokens. Due to the buddy allocation algorithm, Orca core lines reserve space for 1,024 request output tokens, regardless of how they predict the output lengths. For this reason, the three Orca baselines behave similarly. In contrast, vLLM can effectively
Handling long claims, PagedAttention solves the problem of memory fragmentation and reservation.
Authors:
(1) Woo-Seok Kwon, UC Berkeley with equal contributions;
(2) Chuhan Li, UC Berkeley with equal contribution;
(3) Siyuan Zhuang, University of California, Berkeley;
(4) Ying Sheng, UC Berkeley and Stanford University;
(5) Lianmin Zeng, University of California, Berkeley;
(6) Cody Hao Yu, independent researcher;
(7) Cody Hao Yu, independent researcher;
(8) Joseph E. Gonzalez, University of California, Berkeley;
(9) Hao Zhang, University of California, San Diego;
(10) Ion Stoica, University of California, Berkeley.