Name: Load-Aware GPU Fractioning for LLM Inference on Kubernetes - Olivier Tardieu & Yue Zhu, IBM
Start: 2024-11-14T15:25:00-0700
End: 2024-11-14T16:00:00-0700

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Thursday November 14, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 2 | 255 B

As the popularity of Large Language Models (LLMs) grows, LLM serving systems face challenges in efficiently utilizing GPUs on Kubernetes. In many cases, dedicating an entire GPU to a small or unpopular model is a waste, however understanding the relationship between request load and resource requirements has been difficult. This talk will study GPU compute and memory requirements for LLM inference servers, like vLLM, revealing an analytical relationship between key configuration parameters and performance metrics such as throughput and latency. This novel understanding makes it possible to decide at deployment time an optimal GPU fraction based on the model's characteristics and estimated load. We will demo an open-source controller capable of intercepting inference runtime deployments on Kubernetes to automatically replace requests for whole GPUs with fractional requests using MIG (Multi-Instance GPU) slices, increasing density hence LLM sustainability without sacrificing SLOs.

Speakers

Olivier Tardieu

Principal Research Scientist, Manager, IBM

Dr. Olivier Tardieu is a Principal Research Scientist and Manager at IBM T.J. Watson, NY, USA. He joined IBM Research in 2007. His current research focuses on cloud-related technologies, including Serverless Computing and Kubernetes, as well as their application to Machine Learning... Read More →

Yue Zhu

Staff Research Scientist, IBM Research

Dr. Yue Zhu is a Staff Research Scientist at IBM Research specializing in foundation model systems and distributed storage systems. Yue obtained a Ph.D. in Computer Science from Florida State University in 2021 and has consistently contribute to sustainability for foundation models... Read More →

AutoFit pdf

Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 B

Emerging + Advanced

Content Experience Level Intermediate

KubeCon + CloudNativeCon North America 2024

Olivier Tardieu

Yue Zhu

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!