Training large language models (LLMs) on GPUs within Kubernetes environments involves significant configuration and complexity, often leading to unique failure scenarios. This presentation will cover the lessons learned from training DBRX, a state-of-the-art LLM, that we developed on a 400-node cluster with a primary workload utilizing 3072 GPUs and the tooling needed to measure and maintain a healthy fleet of nodes and underlying interconnect fabric.
This will include:
- How we implemented GPU health detection leveraging Prometheus and DCGM Exporter
- How we monitor GPU Direct Remote Direct Memory Access (GDRDMA) and the challenges of monitoring components that bypass CPU
- Discussion of failure scenarios during training, and how they were addressed Databricks Mosaic AI Training leverages GPU clusters across many cloud providers to maximize availability; we will also discuss the variations we see and how we had to engineer around them.