Experience Operating Large GPU Clusters at Organizational Scale

503 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Experience Operating Large GPU Clusters at Organizational Scale

Speakers: Vikas Mehta, Bugra Gedik, Mohamed Fawzy, & Vipin Sirohi from NVIDIA 

We outline Nvidia's experience managing a large-scale internal GPU compute platform spanning multiple heterogeneous clusters. The platform supports thousands of users and hundreds of project accounts, handling a diverse mix of training and batch inference workloads across various research fields. We focus on three key challenges: researcher productivity, resource utilization, and operational efficiency. 

To improve researcher productivity, we emphasize fair scheduling and workload resilience. To keep resource utilization high, we discuss strategies to maintain high occupancy. On the operational efficiency front, we highlight our scheduler simulation capabilities, which enable safe testing of changes without affecting production workloads. The presentation concludes with key lessons learned and our vision for future improvements.

Upcoming Events for 2025:
AI & Data - June 25, 2025
Networking - August 13, 2025
Product - October 22, 2025

Learn more about the @Scale conference here: https://atscaleconference.com/

@Scale is a technical conference series for engineers who build or maintain systems designed for scale. New for 2025, in person and virtual attendance options will be available at all four of our programs, which will bring together complementary themes to create event communities to spark cross-discipline collaboration.					

Experience Operating Large GPU Clusters at Organizational Scale

Nhạc Theo Chủ Đề

Liên kết website