Loading…
Attending this event?
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Friday November 15, 2024 4:55pm - 5:30pm MST
Large Language Models have shown remarkable capabilities in various tasks, from text generation to code writing. However, the inference process for these models presents significant challenges. LLMs are computationally intensive, often requiring specialized hardware like TPUs or GPUs to achieve reasonable response times. In some cases their substantial size can strain the resources of a single machine. Specifically, models such as Gemini, Claude, and GPT4 are too large to fit on any single GPU or TPU device, let alone on any single multi-accelerator machine, necessitating what we refer to as multi-node server deployment where a single model server “backend” runs as a distributed process on multiple nodes to harness enough accelerator memory to fit and run the model. This talk presents LeaderWorkerSet, a new k8s API that enables multi-node model inference. We demonstrate its capabilities by orchestrating state of the art model servers such as vLLM and JetStream on both GPUs and TPUs.
Speakers
avatar for Abdullah Gharaibeh

Abdullah Gharaibeh

Staff Software Engineer, Google
Abdullah is a staff software engineer at Google and sig-scheduling and working group batch co-chair. He works on Kubernetes and Google Kubernetes Engine, focusing on scheduling and batch workloads.
avatar for Rupeng Liu

Rupeng Liu

Software engineer, Google
Rupeng Liu, a software engineer from the Google's Kubernetes inference team
Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link