Loading…
Attending this event?
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
AI + ML clear filter
arrow_back View All Dates
Wednesday, November 13
 

11:15am MST

Advanced Model Serving Techniques with Ray on Kubernetes - Andrew Sy Kim, Google & Kai-Hsun Chen, Anyscale
Wednesday November 13, 2024 11:15am - 11:50am MST
With the proliferation of Large Language Models, Ray, a distributed open-source framework for scaling AI/ML, has developed many advanced techniques for serving LLMs in a distributed environment. In this session, Andrew Sy Kim and Kai-Hsun Chen will provide an in-depth exploration of advanced model serving techniques using Ray, covering model composition, model multiplexing and fractional GPU scheduling. Additionally, they will discuss ongoing initiatives in Ray focused on GPU-native communication, which, when combined with Kubernetes DRA, offers a scalable approach to tensor parallelism, a technique used to fit large models across multiple GPUs. Finally, they will present a live demo, demonstrating how KubeRay enables the practical application of these techniques to real-world LLM deployments on Kubernetes. The demo will showcase Ray’s powerful capabilities to scale, compose and orchestrate popular open-source models across a diverse set of hardware accelerators and failure domains.
Speakers
avatar for Andrew Sy Kim

Andrew Sy Kim

Software Engineer, Google
Andrew Sy Kim is a software engineer at Google working on Kubernetes and GKE.
avatar for Kai-Hsun Chen

Kai-Hsun Chen

Software Engineer, Anyscale
Kai-Hsun Chen is a software engineer on the Ray Core team at Anyscale and the primary maintainer of KubeRay. He is also an open-source enthusiast, as well as a committer and PMC member of Apache Submarine.
Wednesday November 13, 2024 11:15am - 11:50am MST
Salt Palace | Level 2 | 255 EF
  AI + ML

12:10pm MST

AI and ML: Let’s Talk About the Boring (yet Critical!) Operational Side - Rob Koch, Slalom Build & Milad Vafaeifard, Epam
Wednesday November 13, 2024 12:10pm - 12:45pm MST
As AI and ML become increasingly prevalent, it’s worth looking harder at the operational side of running these applications. We need a lot of compute and access to GPU workloads. We also need to be reliable, while providing rock-solid separation between datasets and training processes. And we need great observability in case things go wrong, and must be simple to operate. Let's build our ML applications on top of a service mesh instead of spending resources reimplementing the wheel – or, worse, the flat tire. Join us for a lively, informative, and entertaining look at how a service mesh can solve real-world issues with ML applications while making it simpler and faster to actually get things done in the world of ML. Rob Koch, Principal at Slalom Build, will demonstrate how you can use Linkerd together with multiple clusters to develop, debug, and deploy an ML application in Kubernetes (including IPv6 and GPUs), with special attention to multitenancy and scaling.
Speakers
avatar for Rob Koch

Rob Koch

Principal, Slalom Build
A tech enthusiast who thrives on steering projects from their initial spark to successful fruition, Rob Koch is Principal at Slalom Build, AWS Hero, and Co-chair of the CNCF Deaf and Hard of Hearing Working Group. His expertise in architecting event-driven systems is firmly rooted... Read More →
avatar for Milad Vafaeifard

Milad Vafaeifard

Lead Software Engineer, Epam
Milad Vafaeifard, a Lead Software Engineer at EPAM Systems, has 9+ years of web design and development expertise. Deaf but undeterred, he is the creative force behind Sign Language Tech and an active contributor to a YouTube channel focused on tech content for the signing tech community... Read More →
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 2 | 255 EF
  AI + ML
  • Content Experience Level Any

12:10pm MST

Operationalizing High-Performance GPU Clusters in Kubernetes: A Case Study of Databricks' DBRX - Will Gleich & Wai Wu, Databricks
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Training large language models (LLMs) on GPUs within Kubernetes environments involves significant configuration and complexity, often leading to unique failure scenarios. This presentation will cover the lessons learned from training DBRX, a state-of-the-art LLM, that we developed on a 400-node cluster with a primary workload utilizing 3072 GPUs and the tooling needed to measure and maintain a healthy fleet of nodes and underlying interconnect fabric. This will include: * How we implemented GPU health detection leveraging Prometheus and DCGM Exporter * How we monitor GPU Direct Remote Direct Memory Access (GDRDMA) and the challenges of monitoring components that bypass CPU * Discussion of failure scenarios during training, and how they were addressed Databricks Mosaic AI Training leverages GPU clusters across many cloud providers to maximize availability; we will also discuss the variations we see and how we had to engineer around them.
Speakers
WW

Wai Wu

Databricks
avatar for Will Gleich

Will Gleich

Sr. DevOps Engineer, Databricks
Will Gleich is a Sr. DevOps engineer at Databricks specializing in MLOps and Site Reliability Engineering.
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

2:30pm MST

Architecting the Future of AI: From Cloud-Native Orchestration to Advanced LLMOps - Ion Stoica, Anyscale
Wednesday November 13, 2024 2:30pm - 3:05pm MST
With the groundbreaking release of ChatGPT, large language models (LLMs) have taken the world by storm: they have enabled new applications, have exacerbated GPU shortage, and raised new questions about their answers’ veracity. This talk delves into an AI stack, encompassing cloud-native orchestration, distributed computing, and advanced LLMOps. Key topics include: - Kubernetes: The foundational technology that seamlessly manages AI workloads across diverse cloud environments. - Ray: The versatile, open-source framework that streamlines the development and scaling of distributed applications. - vLLM: The cutting-edge, high-performance, and memory-efficient inference and serving engine designed specifically for large language models. Attendees will gain insights into the architecture and integration of these powerful tools, driving innovation and efficiency in the deployment of AI solutions.
Speakers
avatar for Ion Stoica

Ion Stoica

Co-founder, executive chairman & president, Anyscale
Ion Stoica is a Professor in the EECS Department at the University of California at Berkeley, and the Director of SkyLab. He is currently doing research on cloud computing and AI systems. Past work includes Ray, Apache Spark, Apache Mesos, Tachyon, Chord DHT, and Dynamic Packet State... Read More →
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 2 | 255 EF
  AI + ML
  • Content Experience Level Any

2:30pm MST

Optimizing LLM Performance in Kubernetes with OpenTelemetry - Ashok Chandrasekar, Google & Liudmila Molkova, Microsoft
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Large Language Models are increasing in popularity and their deployments on Kubernetes have steadily increased. LLM applications bring new usage patterns that the industry does not have the expertise in. At the same time, there is a lack of observability in these deployments which makes it difficult to debug performance issues. We will present an end to end walkthrough of how you can leverage client and server LLM observability using Open Telemetry based on the recent efforts in the Kubernetes and Open Telemetry communities to standardize these across LLM clients and model servers. We will also demonstrate how to troubleshoot a real-world performance issue in your LLM deployment and how to optimize your LLM server setup for better performance on Kubernetes. We'll show how to use Kubernetes autoscaling based on custom model server metrics and demonstrate how they offer a superior alternative to using GPU utilization metrics for such deployments.
Speakers
avatar for Liudmila Molkova

Liudmila Molkova

Principal Software Engineer, Microsoft
Liudmila Molkova is a Principal Software Engineer at Microsoft working on observability and Azure client libraries. She is a co-author of distributed tracing implementations across the .NET ecosystem including HTTP client instrumentation and Azure Functions. Liudmila is an active... Read More →
avatar for Ashok Chandrasekar

Ashok Chandrasekar

Senior Software Engineer, Google
Ashok Chandrasekar is a Senior Software Engineer at Google working on AI/ML experience for Google Kubernetes Engine. Previously he was a Staff Engineer at VMware where he led the cluster lifecycle management area for Tanzu Mission Control. He has 7 years of Kubernetes experience working... Read More →
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

3:25pm MST

A Tale of 2 Drivers: GPU Configuration on the Fly Using DRA - Alay Patel & Varun Ramachandra Sekar US, Nvidia
Wednesday November 13, 2024 3:25pm - 4:00pm MST
NVIDIA’s GeForceNow is a cloud gaming service that allows users to stream video games from NVIDIA's servers to a wide range of devices, including PCs, Macs, Android devices, iOS devices, and smart TVs. Under the hood, it is powered by Kubernetes running Kubevirt VMs. For a seamless user experience, GeForceNow dynamically switches GPU drivers to accommodate either passing through an entire GPU or slicing it into multiple virtual GPUs, all while keeping utilization close to 100% across the datacenter. This poses significant challenges when using the traditional device plugin API provided by Kubernetes. In this talk, we explore GeForce Now’s journey to transition away from the traditional device plugin API in favor of Dynamic Resource Allocation (DRA). We'll share valuable insights for anyone looking to perform a similar migration of their own. Join us to learn about the challenges, solutions, and best practices to help optimize your GPU-accelerated workloads in the cloud.
Speakers
avatar for Alay Patel

Alay Patel

Senior Software Engineer, Nvidia
Alay is a Senior Software Engineer at Nvidia where he works on cloud gaming service, exposing infrastructure for GPU workloads. He is passionate about open source with a focus on Kubernetes and platform engineering.
avatar for Varun Ramachandra Sekar US

Varun Ramachandra Sekar US

Senior Software Engineer, Nvidia
Developer by day, Dog whisperer by night.
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 EF
  AI + ML

3:25pm MST

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes - David Gray, Red Hat
Wednesday November 13, 2024 3:25pm - 4:00pm MST
As generative AI language models improve, they are increasingly being integrated into business-critical applications. However, large language model (LLM) inference is a compute-intensive workload that often requires expensive GPU hardware. Making efficient use of these hardware resources in the public or private cloud is critical for managing costs and power usage. This talk introduces the KServe platform for deploying LLMs on Kubernetes and provides an overview of LLM inference performance concepts. Attendees will learn techniques to improve load balancing and autoscaling for LLM inference, such as leveraging KServe, Knative, and GPU operator features. Sharing test results, we will analyze the impact of these optimizations on key performance metrics, such as latency per token and tokens per second. This talk equips participants with strategies to maximize the efficiency of LLM inference deployments on Kubernetes, ultimately reducing costs and improving resource utilization.
Speakers
avatar for David Gray

David Gray

Senior Software Engineer, Red Hat
David Gray is a Senior Software Engineer on the Performance and Scale team at Red Hat. His role involves analyzing and improving AI inference workloads on Kubernetes platforms. David is actively engaged in performance experimentation and analysis of running large language models in... Read More →
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML
  • Content Experience Level Any

4:30pm MST

Making Kubernetes Simpler for Accelerated Workloads - Susan Wu, Google; Lucy Sweet, Uber; Mitch McKenzie, Weave; Aditya Shanker, Crusoe
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Kubernetes and the open-source ecosystem for AI frameworks have been great for LLM innovation, empowering developers to build applications that use natural language as the interface to data. Yet, many developers and cluster operators struggle to put these frameworks into production use. In this session, hear from several platform engineers responsible for designing core infrastructure supporting accelerated workloads, services, large language model training and inference pipelines. You can expect to come away with guidance, hear of pitfalls to watch out for and learn how they successfully abstracted the infrastructure complexity to improve their research users' experience and velocity. Panelists include: Lucy Sweet, Senior Software Engineer (Infrastructure), Uber, Mitch McKenzie, Site Reliability Engineer - Machine Learning Operations, Weave, Susan Wu, Outbound Product Manager, Google
Speakers
avatar for Susan Wu

Susan Wu

Outbound Product Manager, Google
Susan is an Outbound Product Manager for Google Cloud, focusing on GKE Networking and Network Security. She previously led product and technical marketing roles at VMware, Sun/Oracle, Canonical, Docker, Citrix and Midokura (part of Sony Group). She is a frequent speaker at conferences... Read More →
avatar for Lucy Sweet

Lucy Sweet

Senior Software Engineer at Uber, Uber
Lucy is a Senior Software Engineer at Uber Denmark who works on software infrastructure
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 EF
  AI + ML

4:30pm MST

Platform Performance Optimization for AI - a Resource Management Perspective - Antti Kervinen, Intel & Dixita Narang, Google
Wednesday November 13, 2024 4:30pm - 5:05pm MST
How much node resource management can affect AI workload performance? What options are there? What is the trade-off between total throughput and low latencies? In this talk we take a systematic approach to Platform Performance Optimization. We walk through the whole path from goal setting, gathering data, analysis, visualizations and conclusions. At each stop along the path we share our practical experiences in a case of LLM inference optimization. You will find many considerations, findings and practical tricks to take away. For instance, how to instrument PyTorch without touching the source or a container image, how to enable changing what we are measuring without new expensive benchmark reruns, and how much more we can learn from visualizations compared to numeric averages and percentiles. Finally we share real results from our case: how resource management increased total token throughput per worker node by more than 3.5x from the baseline.
Speakers
avatar for Antti Kervinen

Antti Kervinen

Cloud Orchestration Software Engineer, Intel
Antti Kervinen is a Cloud Orchestration Software Engineer working at Intel, whose interest in Linux and distributed systems has led him from academic research of concurrency to the world of Kubernetes. When unplugged, Antti spends his time outdoors discovering wonders of nature.
avatar for Dixita Narang

Dixita Narang

Software Engineer, Google
Dixita Narang is a Software Engineer at Google on the Kubernetes Node team. With a primary focus on resource management within Kubernetes, Dixita is deeply involved in the development and advancement of the Memory QoS feature, which is currently in the alpha stage. She is a new contributor... Read More →
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

5:25pm MST

Detecting and Overcoming GPU Failures During ML Training - Sarah Belghiti, Wayve & Ganeshkumar Ashokavardhanan, Microsoft
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Scaling ML training demands powerful GPU infrastructure, and as model sizes and training scale increases, GPU failures become an expensive risk. From outright hardware faults to subtle performance degradation, undetected GPU problems can sabotage training jobs, inflating costs and slowing development. This talk dives into GPU failure challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on cloud provider and autonomous vehicle company experience, we will share best practices for efficient identification, remediation, and prevention of GPU failures. We will also explore cutting-edge ideas like CRIU and task pre-emption for GPU workloads.
Speakers
avatar for Ganeshkumar Ashokavardhanan

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft
Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →
avatar for Sarah Belghiti

Sarah Belghiti

ML Platform Engineer, Wayve
Sarah Belghiti is an ML Platform Engineer at Wayve, a leading developer of embodied intelligence for autonomous vehicles. She works on the infrastructure, scheduling and monitoring of ML workloads. With GPUs becoming an increasingly scarce resource, her focus has been on building... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 155 EF
  AI + ML

5:25pm MST

Production AI at Scale: Cloudera’s Journey in Building a Robust Inference Platform - Zoram Thanga & Peter Ableda, Cloudera
Wednesday November 13, 2024 5:25pm - 6:00pm MST
In this session, we talk about Cloudera AI Inference Service, a secure, large scale platform for generative AI and predictive inference workloads, built using state of the art Kubernetes, CNCF and Apache open source projects. We take the audience through our journey in building this platform and share the experiences we gained along the way. The platform is built using openness, security, scalability, performance and standards compliance as guiding principles. We demonstrate that it is possible to be open and secure at the same time, and that organizations can incorporate production grade AI inferencing into their Big Data environments. This session will cover the architecture of the platform, and explain how we handle performance, scaling, authentication, fine grained authorization and audit logging, all of which are critical considerations for production inferencing.
Speakers
avatar for Peter Ableda

Peter Ableda

Director, Product Management, Cloudera
Peter Ableda is the Director of Product Management for Cloudera’s AI product suite, bringing over a decade of experience in data management and advanced analytics. Holding a Master of Science degree in Computer Science from the Budapest University of Technology, Peter has dedicated... Read More →
avatar for Zoram Thanga

Zoram Thanga

Principal Engineer, Cloudera
Zoram is a Principal Engineer, Enterprise AI Platform in Cloudera. He has been working in the software industry for over 23 years, and has been involved in building clustering software, containers, file systems, analytical query engines, and ML/AI platforms. He is a committer in the... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML
 

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date - 
  • 🚨 Contribfest
  • 🪧 Poster Sessions
  • AI + ML
  • Breaks
  • ⚡ Lightning Talks
  • Cloud Native Experience
  • Cloud Native Novice
  • CNCF-hosted Co-located Events
  • Connectivity
  • Data Processing + Storage
  • Emerging + Advanced
  • Experiences
  • Keynote Sessions
  • Maintainer Track
  • Observability
  • Operations + Performance
  • Platform Engineering
  • Project Opportunties
  • Registration
  • SDLC
  • Security
  • Solutions Showcase
  • Sponsor-hosted Co-located Event
  • Tutorials