PITTI - Article - vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Artificial Intelligence,Research,Information Processing | Computing

Date : 2023-06-30

Project page abstract

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we are excited to introduce vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.

vLLM has been developed at UC Berkeley and deployed at Chatbot Arena and Vicuna Demo for the past two months. It is the core technology that makes LLM serving affordable even for a small research team like LMSYS with limited compute resources.

Project page below links to GitHub repo, documentation and research paper.

Link

How hard does Art need to be ?