Speculative Decoding and Efficient LLM Inference with Chris Lott - 717

Speculative Decoding and Efficient LLM Inference with Chris Lott - 717

1.020 Lượt nghe
Speculative Decoding and Efficient LLM Inference with Chris Lott - 717
Today, we're joined by Chris Lott, senior director of engineering at Qualcomm AI Research to discuss accelerating large language model inference. We explore the challenges presented by the LLM encoding and decoding (aka generation) and how these interact with various hardware constraints such as FLOPS, memory footprint and memory bandwidth to limit key inference metrics such as time-to-first-token, tokens per second, and tokens per joule. We then dig into a variety of techniques that can be used to accelerate inference such as KV compression, quantization, pruning, speculative decoding, and leveraging small language models (SLMs). We also discuss future directions for enabling on-device agentic experiences such as parallel generation and software tools like Qualcomm AI Orchestrator. 🎧 / 🎥 Listen or watch the full episode on our page: https://twimlai.com/go/717. 🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1 🗣️ CONNECT WITH US! =============================== Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/ Follow us on Twitter: https://twitter.com/twimlai Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/ Join our Slack Community: https://twimlai.com/community/ Subscribe to our newsletter: https://twimlai.com/newsletter/ Want to get in touch? Send us a message: https://twimlai.com/contact/ 📖 CHAPTERS =============================== 00:00 - Introduction 3:54 - LLMs on the edge 5:47 - The relationship of databases and models in personalization 7:11 - Latency 11:18 - Device constraints 16:42 - Encoding vs. decoding and LLM Metrics 19:14 - Optimizing LLMs for edge deployment 25:36 - SLM 32:39 - KVs 39:36 - KV compression and model architectures 47:16 - Hybrid AI 50:58 - Speculative decoding 1:06:01 - Self-speculative decoding 1:08:55 - Reasoning models 1:12:02 - Inference scaling 1:14:19 - Future directions 🔗 LINKS & RESOURCES =============================== Why Qualcomm AI Orchestrator is the key to next generation AI experiences - https://www.qualcomm.com/news/onq/2024/10/why-qualcomm-ai-orchestrator-is-key-to-next-gen-ai-experiences Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement - https://arxiv.org/abs/2402.14160 Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs On Speculative Decoding for Multimodal Large Language Models - https://arxiv.org/abs/2404.08856 📸 Camera: https://amzn.to/3TQ3zsg 🎙️Microphone: https://amzn.to/3t5zXeV 🚦Lights: https://amzn.to/3TQlX49 🎛️ Audio Interface: https://amzn.to/3TVFAIq 🎚️ Stream Deck: https://amzn.to/3zzm7F5