Loading…
Attending this event?
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Thursday November 14, 2024 3:25pm - 4:00pm MST
As the popularity of Large Language Models (LLMs) grows, LLM serving systems face challenges in efficiently utilizing GPUs on Kubernetes. In many cases, dedicating an entire GPU to a small or unpopular model is a waste, however understanding the relationship between request load and resource requirements has been difficult. This talk will study GPU compute and memory requirements for LLM inference servers, like vLLM, revealing an analytical relationship between key configuration parameters and performance metrics such as throughput and latency. This novel understanding makes it possible to decide at deployment time an optimal GPU fraction based on the model's characteristics and estimated load. We will demo an open-source controller capable of intercepting inference runtime deployments on Kubernetes to automatically replace requests for whole GPUs with fractional requests using MIG (Multi-Instance GPU) slices, increasing density hence LLM sustainability without sacrificing SLOs.
Speakers
avatar for Olivier Tardieu

Olivier Tardieu

Principal Research Scientist, Manager, IBM
Dr. Olivier Tardieu is a Principal Research Scientist and Manager at IBM T.J. Watson, NY, USA. He joined IBM Research in 2007. His current research focuses on cloud-related technologies, including Serverless Computing and Kubernetes, as well as their application to Machine Learning... Read More →
avatar for Yue Zhu

Yue Zhu

Research Scientist, IBM Research
Dr. Yue Zhu is a Research Scientist at IBM Research specializing in foundation model systems and distributed storage systems. Yue obtained a Ph.D. in Computer Science from Florida State University in 2021 and has consistently contribute to sustainability for foundation models and... Read More →
Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 EF
  Emerging + Advanced

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link