Name: Detecting and Overcoming GPU Failures During ML Training - Sarah Belghiti, Wayve & Ganeshkumar Ashokavardhanan, Microsoft
Start: 2024-11-13T17:25:00-0700
End: 2024-11-13T18:00:00-0700

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | 155 EF

Scaling ML training demands powerful GPU infrastructure, and as model sizes and training scale increases, GPU failures become an expensive risk. From outright hardware faults to subtle performance degradation, undetected GPU problems can sabotage training jobs, inflating costs and slowing development. This talk dives into GPU failure challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on cloud provider and autonomous vehicle company experience, we will share best practices for efficient identification, remediation, and prevention of GPU failures. We will also explore cutting-edge ideas like CRIU and task pre-emption for GPU workloads.

Speakers

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft

Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →

Sarah Belghiti

ML Platform Engineer, Wayve

Sarah Belghiti is an ML Platform Engineer at Wayve, a leading developer of embodied intelligence for autonomous vehicles. She works on the infrastructure, scheduling and monitoring of ML workloads. With GPUs becoming an increasingly scarce resource, her focus has been on building... Read More →

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 155 EF

AI + ML

Content Experience Level Intermediate

KubeCon + CloudNativeCon North America 2024

Ganeshkumar Ashokavardhanan

Sarah Belghiti

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!