Challenges with Ultra-low Latency LLM Inference at Scale | Haytham Abuelfutuh

224 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Challenges with Ultra-low Latency LLM Inference at Scale | Haytham Abuelfutuh

In this talk, we will discuss the challenges of running ultra-low latency Large Language Model (LLM) inference at scale. We will cover the unique challenges of LLM inference, such as large model sizes, KV Caching. 

We will also discuss the challenges of scaling LLM inference to handle large volumes of requests, including the need for hardware, efficient scale up, and new routing architectures. 

Finally, we will present some of our recent work on addressing these challenges, including our development of inference infrastructure at Union.

Upcoming Events for 2025:
AI & Data - June 25, 2025
Networking - August 13, 2025
Product - October 22, 2025

Learn more about the @Scale conference here: https://atscaleconference.com/

@Scale is a technical conference series for engineers who build or maintain systems designed for scale. New for 2025, in person and virtual attendance options will be available at all four of our programs, which will bring together complementary themes to create event communities to spark cross-discipline collaboration.					

Challenges with Ultra-low Latency LLM Inference at Scale | Haytham Abuelfutuh

Nhạc Theo Chủ Đề

Liên kết website