Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

1.723 Lượt nghe
Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica
About the seminar: https://faster-llms.vercel.app Speaker: Ion Stoica (Berkeley & Anyscale & Databricks) Title: Accelerating LLM Inference with vLLM (and SGLang) Abstract: Inference efficiency remains a critical challenge for deploying large language models (LLMs) at scale. In this talk, I will present our work on LLM inference we have conducted at Berkeley over the past two years in the context of vLLM and SGLang, which are today the most popular open-source inference engines. In particular, I will describe some of the key techniques they introduced, PagedAttention and RadixAttention, which are now widely used by the majority of LLM inference engines. Finally, I will discuss the new architecture of vLLM. Recorded on Mar 4, 2025.