Distributed ML Talk @ UC Berkeley

Distributed ML Talk @ UC Berkeley

8.299 Lượt nghe
Distributed ML Talk @ UC Berkeley
Here's a talk I gave to to Machine Learning @ Berkeley Club! We discuss various parallelism strategies used in industry when training large ML models at scale. The animations are from my previous YouTube video about parallelism strategies: https://www.youtube.com/watch?v=xkH8shGffRU Papers & Resources Mentioned in Talk: Breadth-First Pipeline Parallelism: https://arxiv.org/abs/2211.05953 MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs: https://arxiv.org/abs/2402.15627 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism: https://arxiv.org/abs/1909.08053 GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism: https://arxiv.org/abs/1811.06965 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: https://arxiv.org/abs/1910.02054 PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel: https://arxiv.org/abs/2304.11277 NeRF-XL: Scaling NeRFs with Multiple GPUs: https://research.nvidia.com/labs/toronto-ai/nerfxl/ Sasha Rush's LLM Training Puzzle: https://github.com/srush/LLM-Training-Puzzles Diffusion Models Are Real-Time Game Engines: https://gamengen.github.io/ *Please comment down below if I missed any!* Timestamps: 0:00 - Introduction 0:39 - Scaling Dimensions 2:22 - About Me 3:19 - The GPU & Brief History Overview 6:22 - Matrix Multiplication 8:37 - Motivation for Parallelism 9:55 - Review of Basic Training Loop 11:05 - Data Parallelism 12:52 - NCCL 15:48 - Pipeline Parallelism 19:04 - Tensor Parallelism 20:46 - Back to DDP 22:13 - Adam Optimizer Review 23:11 - FSDP 26:11 - DeepSpeed 27:39 - Next Steps 29:30 - Galvatron Paper 30:24 - More Papers 32:40 - Orthogonal Optimizations 36:40 - How to Stay in Touch 36:55 - Questions 51:47 - Thank You!