Huawei's ambitious project of integrating 10,000 Ascend NPUs into a Ray cluster pushes the boundaries of distributed computing. In this technical deep dive, Boyuan Chen, Chong Yin Tan, and Xiaoshuang Liu from Huawei share their experiences and innovations in creating a hyperscale Ray-NPU infrastructure.
The presenters detail the challenges of migrating existing business cases to Ray and adding support for Huawei Ascend NPUs. They introduce a novel full-stack Ray-observability engine, crucial for debugging and optimizing their massive cluster. The talk covers key achievements, including seamless NPU and GPU task scheduling within the same cluster and the migration of a hyperscale inference pipeline to Ray. Attendees will learn about Huawei's strategies for maximizing resource utilization and enhancing stability in large-scale deployments, offering valuable insights for organizations looking to scale their AI infrastructure to new heights.
--
Interested in more?
- Watch the full Day 1 Keynote:
https://youtu.be/jwZHJthQvXo
- Watch the full Day 2 Keynote
https://youtu.be/Lury2ad6KG8
- Check out the Ray Summmit Breakout sessions https://youtube.com/playlist?list=PLzTswPQNepXntmT8jr9WaNfqQ60QwW7-U&si=qPw-_SxT9lVmbRGE
--
🔗 Connect with us:
- Subscribe to our YouTube channel: https://www.youtube.com/@anyscale
- Twitter: https://x.com/anyscalecompute
- LinkedIn: https://linkedin.com/company/joinanyscale/
- Website: https://www.anyscale.com