Name: Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes - David Gray, Red Hat
Start: 2024-11-13T15:25:00-0700
End: 2024-11-13T16:00:00-0700

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 2 | 255 E

As generative AI language models improve, they are increasingly being integrated into business-critical applications. However, large language model (LLM) inference is a compute-intensive workload that often requires expensive GPU hardware. Making efficient use of these hardware resources in the public or private cloud is critical for managing costs and power usage. This talk introduces the KServe platform for deploying LLMs on Kubernetes and provides an overview of LLM inference performance concepts. Attendees will learn techniques to improve load balancing and autoscaling for LLM inference, such as leveraging KServe, Knative, and GPU operator features. Sharing test results, we will analyze the impact of these optimizations on key performance metrics, such as latency per token and tokens per second. This talk equips participants with strategies to maximize the efficiency of LLM inference deployments on Kubernetes, ultimately reducing costs and improving resource utilization.

Speakers

David Gray

Senior Software Engineer, Red Hat

David Gray is a Senior Software Engineer on the Performance and Scale team at Red Hat. His role involves analyzing and improving AI inference workloads on Kubernetes platforms. David is actively engaged in performance experimentation and analysis of running large language models in... Read More →

David Gray KCNA24 LLM Inference load balancing and autoscaling pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Any

KubeCon + CloudNativeCon North America 2024

David Gray

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!