In the enterprise AI landscape, balancing speed, cost, and performance is critical. This talk explores the innovative techniques behind Command A's efficient inference pipeline, designed to deliver high-quality results at a low cost. We’ll delve into interleaved sliding window attention, which enhances both quality and speed, and discuss our optimizations like Speculative Decoding, sharing key insights from its training process. Join us to learn how Command A is redefining cost-effective AI for enterprise applications.
Chapters
0:00 – Introduction to Command R+ Inference Optimization
0:55 – Sparse Attention Architecture & Sliding Window
2:21 – Speculative Decoding Overview
4:32 – Using Medusa for Parallel Token Prediction
6:29 – Evaluation and Training with W&B
7:54 – Synthetic vs. Original Data in Speculative Training
9:00 – Final Gains and Performance Tradeoffs
11:44 – Guided Decoding with Speculative Inference
14:29 – Dynamic Guided Decoding and FSM Integration
19:03 – Combining Guided Decoding with Speculative Tokens