Name: Building Resilience for Large-Scale AI Training: GPU Management, Failure Detection, and Beyond - Ganeshkumar Ashokavardhanan, Microsoft & Ace Eldeib, Cohere
Start: 2024-11-13T17:25:00-0700
End: 2024-11-13T18:00:00-0700

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | 155 E

As AI training scales to thousands of GPUs across hundreds of machines, hardware failure becomes an expensive risk. From GPU faults to network performance degradation, undetected problems can sabotage training jobs, inflating costs, and slowing development. This talk dives into failure and orchestration challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on experience from cloud providers and training large language models, we will share best practices for efficient identification, remediation, and prevention of GPU failures.

Speakers

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft

Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →

Ace Eldeib

Staff Software Engineer, Cohere

Ace is a Staff Software Engineer at Cohere working on training and serving infrastructure for large language models. Prior to that, he worked on Azure Kubernetes service and ran self-managed Kubernetes for other Azure services.

Upload Kubecon Resilience for AI Training pdf

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 155 E

AI + ML

Content Experience Level Intermediate

KubeCon + CloudNativeCon North America 2024

Ganeshkumar Ashokavardhanan

Ace Eldeib

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!