Scaling Ray to 10K NPUs: Huawei's Hyperscale Journey | Ray Summit 2024

Scaling Ray to 10K NPUs: Huawei's Hyperscale Journey | Ray Summit 2024

864 Lượt nghe
Scaling Ray to 10K NPUs: Huawei's Hyperscale Journey | Ray Summit 2024
Huawei's ambitious project of integrating 10,000 Ascend NPUs into a Ray cluster pushes the boundaries of distributed computing. In this technical deep dive, Boyuan Chen, Chong Yin Tan, and Xiaoshuang Liu from Huawei share their experiences and innovations in creating a hyperscale Ray-NPU infrastructure. The presenters detail the challenges of migrating existing business cases to Ray and adding support for Huawei Ascend NPUs. They introduce a novel full-stack Ray-observability engine, crucial for debugging and optimizing their massive cluster. The talk covers key achievements, including seamless NPU and GPU task scheduling within the same cluster and the migration of a hyperscale inference pipeline to Ray. Attendees will learn about Huawei's strategies for maximizing resource utilization and enhancing stability in large-scale deployments, offering valuable insights for organizations looking to scale their AI infrastructure to new heights. -- Interested in more? - Watch the full Day 1 Keynote: https://youtu.be/jwZHJthQvXo - Watch the full Day 2 Keynote https://youtu.be/Lury2ad6KG8 - Check out the Ray Summmit Breakout sessions https://youtube.com/playlist?list=PLzTswPQNepXntmT8jr9WaNfqQ60QwW7-U&si=qPw-_SxT9lVmbRGE -- 🔗 Connect with us: - Subscribe to our YouTube channel: https://www.youtube.com/@anyscale - Twitter: https://x.com/anyscalecompute - LinkedIn: https://linkedin.com/company/joinanyscale/ - Website: https://www.anyscale.com