Hardware-aware Algorithms for Sequence Modeling - Tri Dao | Stanford MLSys #87

Hardware-aware Algorithms for Sequence Modeling - Tri Dao | Stanford MLSys #87

6.739 Lượt nghe
Hardware-aware Algorithms for Sequence Modeling - Tri Dao | Stanford MLSys #87
Episode 87 of the Stanford MLSys Seminar Series! Hardware-aware Algorithms for Sequence Modeling Speaker: Tri Dao Abstract: Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. In the first half, we describe attention approximation algorithms using sparsity and low-rank structures, as well as algorithms (e.g. FlashAttention) to achieve fast and memory-efficient exact attention. By making attention algorithms IO-aware (accounting for reads and writes between levels of GPU memory) one can speed up attention by 4-8x, enabling 4-16x longer context in Transformers and yielding higher quality models. We will also describe optimizations for long-context LLM inference, leading to 2-8x faster end-to-end inference time. In the second half, we describe recent progress on subquadratic-time architectures such as RNNs, gated convolution, and structured state space models (SSMs). We identify that a key weakness of such models is their inability to perform content-based reasoning, and propose a selection mechanism to address this shortcoming. Though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture (Mamba) without attention or even MLP blocks. Mamba matches or exceeds the performance of strong modern Transformers on language modeling. Bio: Tri Dao is an incoming Assistant Professor at Princeton University and is currently chief scientist of Together AI. He completed his PhD in Computer Science at Stanford, co-advised by Christopher Ré and Stefano Ermon. He works at the intersection of machine learning and systems, and his research interests include sequence models with long-range memory and structured matrices for compact deep learning models. His work has received the ICML 2022 Outstanding paper runner-up award. -- Stanford MLSys Seminar hosts: Avanika Narayan, Benjamin Spector, Michael Zhang Twitter: https://twitter.com/Avanika15​ https://twitter.com/bfspector https://twitter.com/mzhangio -- Check out our website for the schedule: http://mlsys.stanford.edu Join our mailing list to get weekly updates: https://groups.google.com/forum/#!forum/stanford-mlsys-seminars/join #machinelearning #ai #artificialintelligence #systems #mlsys #computerscience #stanford