Hardware-aware Algorithms for Sequence Modeling - Tri Dao | Stanford MLSys #87

6.739 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Hardware-aware Algorithms for Sequence Modeling - Tri Dao | Stanford MLSys #87

Episode 87 of the Stanford MLSys Seminar Series!

Hardware-aware Algorithms for Sequence Modeling
Speaker: Tri Dao

Abstract:
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length.
In the first half, we describe attention approximation algorithms using sparsity and low-rank structures, as well as algorithms (e.g. FlashAttention) to achieve fast and memory-efficient exact attention. By making attention algorithms IO-aware (accounting for reads and writes between levels of GPU memory) one can speed up attention by 4-8x, enabling 4-16x longer context in Transformers and yielding higher quality models. We will also describe optimizations for long-context LLM inference, leading to 2-8x faster end-to-end inference time.
In the second half, we describe recent progress on subquadratic-time architectures such as RNNs, gated convolution, and structured state space models (SSMs). We identify that a key weakness of such models is their inability to perform content-based reasoning, and propose a selection mechanism to address this shortcoming. Though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture (Mamba) without attention or even MLP blocks. Mamba matches or exceeds the performance of strong modern Transformers on language modeling.

Bio:
Tri Dao is an incoming Assistant Professor at Princeton University and is currently chief scientist of Together AI. He completed his PhD in Computer Science at Stanford, co-advised by Christopher Ré and Stefano Ermon. He works at the intersection of machine learning and systems, and his research interests include sequence models with long-range memory and structured matrices for compact deep learning models. His work has received the ICML 2022 Outstanding paper runner-up award.

--

Stanford MLSys Seminar hosts: Avanika Narayan, Benjamin Spector, Michael Zhang

Twitter:
https://twitter.com/Avanika15​
https://twitter.com/bfspector
https://twitter.com/mzhangio

--

Check out our website for the schedule: http://mlsys.stanford.edu
Join our mailing list to get weekly updates: https://groups.google.com/forum/#!forum/stanford-mlsys-seminars/join

#machinelearning #ai #artificialintelligence #systems #mlsys #computerscience #stanford					

Hardware-aware Algorithms for Sequence Modeling - Tri Dao | Stanford MLSys #87

Nhạc Theo Chủ Đề

Liên kết website