Name: Distributed Multi-Node Model Inference Using the LeaderWorkerSet API - Abdullah Gharaibeh & Rupeng Liu, Google
Start: 2024-11-15T16:55:00-0700
End: 2024-11-15T17:30:00-0700

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 2 | 255 E

Large Language Models have shown remarkable capabilities in various tasks, from text generation to code writing. However, the inference process for these models presents significant challenges. LLMs are computationally intensive, often requiring specialized hardware like TPUs or GPUs to achieve reasonable response times. In some cases their substantial size can strain the resources of a single machine. Specifically, models such as Gemini, Claude, and GPT4 are too large to fit on any single GPU or TPU device, let alone on any single multi-accelerator machine, necessitating what we refer to as multi-node server deployment where a single model server “backend” runs as a distributed process on multiple nodes to harness enough accelerator memory to fit and run the model. This talk presents LeaderWorkerSet, a new k8s API that enables multi-node model inference. We demonstrate its capabilities by orchestrating state of the art model servers such as vLLM and JetStream on both GPUs and TPUs.

Speakers

Abdullah Gharaibeh

Staff Software Engineer, Google

Abdullah is a staff software engineer at Google and sig-scheduling and working group batch co-chair. He works on Kubernetes and Google Kubernetes Engine, focusing on scheduling and batch workloads.

Rupeng Liu

Software engineer, Google

Rupeng Liu, a software engineer from the Google's Kubernetes inference team

LeaderWorkerSet for distributed inference.pptx (1) pdf

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

KubeCon + CloudNativeCon North America 2024

Abdullah Gharaibeh

Rupeng Liu

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!