Name: Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - Arpit Singh & Abhijit Paithankar, NVIDIA
Start: 2024-11-15T14:55:00-0700
End: 2024-11-15T15:30:00-0700

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Friday November 15, 2024 2:55pm - 3:30pm MST

Salt Palace | Level 2 | 255 E

In K8s based ML platforms, job failures from hardware errors such as GPU malfunctions, network disruptions, ECC errors, and OOM events pose significant challenges. These failures cause resource underutilization, wasted engineering time, and high operational costs, often requiring users to resubmit jobs. Current AI/ML frameworks lack adequate fault tolerance strategies, typically requiring manual intervention and causing delays before jobs can resume. This talk explores fault tolerance strategies including naive job restarts on failure, job restarts with hot spares, and job restarts by replacing faulty nodes. We discuss how to achieve fault propagation by leveraging node and pod conditions and address gaps in fault discovery and error propagation in the existing Kubernetes ecosystem. Our talk will also include ways to enhance components like the node-problem-detector and introduce new elements to close the gaps in fault detection , propagation reaction and remediation.

Speakers

Abhijit Paithankar

Tech Lead and Engineering Manager, NVIDIA

Abhijit Paithankar is the AI and HPC Systems Tech Lead and Engineering Manager at NVIDIA, focusing on advanced computing technologies. Previously, he co-founded Crave.IO and served as CTO, and held key roles at Nutanix and VMware, developing critical hypervisor and storage solutions... Read More →

Arpit Singh (SW-CLOUD) US

Senior Software Engineer, Nvidia

Arpit Singh specializes in AI infrastructure at Nvidia, enhancing deep learning applications. Besides being a Kubernetes contributor, Arpit has 10+ years of experience spanning Nvidia, Nutanix and Cisco. He holds multiple patents (2 granted, 4+ pending) and has dual master's degr... Read More →

Fault Tolerance AI workloads pdf

Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Beginner

KubeCon + CloudNativeCon North America 2024

Abhijit Paithankar

Arpit Singh (SW-CLOUD) US

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!