Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

1.723 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

About the seminar: https://faster-llms.vercel.app

Speaker: Ion Stoica (Berkeley & Anyscale & Databricks)

Title: Accelerating LLM Inference with vLLM (and SGLang)

Abstract: Inference efficiency remains a critical challenge for deploying large language models (LLMs) at scale. In this talk, I will present our work on LLM inference we have conducted at Berkeley over the past two years in the context of vLLM and SGLang, which are today the most popular open-source inference engines. In particular, I will describe some of the key techniques they introduced, PagedAttention and RadixAttention, which are now widely used by the majority of LLM inference engines. Finally, I will discuss the new architecture of vLLM.

Recorded on Mar 4, 2025.

Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

Nhạc Theo Chủ Đề

Liên kết website