Name: Operationalizing High-Performance GPU Clusters in Kubernetes: A Case Study of Databricks' DBRX - Will Gleich & Wai Wu, Databricks
Start: 2024-11-13T12:10:00-0700
End: 2024-11-13T12:45:00-0700

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 1 | Hall DE

Training large language models (LLMs) on GPUs within Kubernetes environments involves significant configuration and complexity, often leading to unique failure scenarios. This presentation will cover the lessons learned from training DBRX, a state-of-the-art LLM, that we developed on a 400-node cluster with a primary workload utilizing 3072 GPUs and the tooling needed to measure and maintain a healthy fleet of nodes and underlying interconnect fabric. This will include: * How we implemented GPU health detection leveraging Prometheus and DCGM Exporter * How we monitor GPU Direct Remote Direct Memory Access (GDRDMA) and the challenges of monitoring components that bypass CPU * Discussion of failure scenarios during training, and how they were addressed Databricks Mosaic AI Training leverages GPU clusters across many cloud providers to maximize availability; we will also discuss the variations we see and how we had to engineer around them.

Speakers

Wai Wu

Databricks

Will Gleich

Sr. DevOps Engineer, Databricks

Will Gleich is a Sr. DevOps engineer at Databricks specializing in MLOps and Site Reliability Engineering.

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Intermediate

KubeCon + CloudNativeCon North America 2024

Wai Wu

Will Gleich

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!