KubeCon + CloudNativeCon North America 2024: Full Schedule

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

11:15am MST

Advanced Model Serving Techniques with Ray on Kubernetes - Andrew Sy Kim, Google & Kai-Hsun Chen, Anyscale

Wednesday November 13, 2024 11:15am - 11:50am MST

Salt Palace | Level 2 | 255 EF

With the proliferation of Large Language Models, Ray, a distributed open-source framework for scaling AI/ML, has developed many advanced techniques for serving LLMs in a distributed environment. In this session, Andrew Sy Kim and Kai-Hsun Chen will provide an in-depth exploration of advanced model serving techniques using Ray, covering model composition, model multiplexing and fractional GPU scheduling. Additionally, they will discuss ongoing initiatives in Ray focused on GPU-native communication, which, when combined with Kubernetes DRA, offers a scalable approach to tensor parallelism, a technique used to fit large models across multiple GPUs. Finally, they will present a live demo, demonstrating how KubeRay enables the practical application of these techniques to real-world LLM deployments on Kubernetes. The demo will showcase Ray’s powerful capabilities to scale, compose and orchestrate popular open-source models across a diverse set of hardware accelerators and failure domains.

Speakers

Andrew Sy Kim

Software Engineer, Google

Andrew Sy Kim is a software engineer at Google working on Kubernetes and GKE.

Kai-Hsun Chen

Software Engineer, Anyscale

Kai-Hsun Chen is a software engineer on the Ray Core team at Anyscale and the primary maintainer of KubeRay. He is also an open-source enthusiast, as well as a committer and PMC member of Apache Submarine.

Wednesday November 13, 2024 11:15am - 11:50am MST
Salt Palace | Level 2 | 255 EF

AI + ML

Content Experience Level Advanced

12:10pm MST

AI and ML: Let’s Talk About the Boring (yet Critical!) Operational Side - Rob Koch, Slalom Build & Milad Vafaeifard, Epam

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 2 | 255 EF

As AI and ML become increasingly prevalent, it’s worth looking harder at the operational side of running these applications. We need a lot of compute and access to GPU workloads. We also need to be reliable, while providing rock-solid separation between datasets and training processes. And we need great observability in case things go wrong, and must be simple to operate. Let's build our ML applications on top of a service mesh instead of spending resources reimplementing the wheel – or, worse, the flat tire. Join us for a lively, informative, and entertaining look at how a service mesh can solve real-world issues with ML applications while making it simpler and faster to actually get things done in the world of ML. Rob Koch, Principal at Slalom Build, will demonstrate how you can use Linkerd together with multiple clusters to develop, debug, and deploy an ML application in Kubernetes (including IPv6 and GPUs), with special attention to multitenancy and scaling.

Speakers

Rob Koch

Principal, Slalom Build

A tech enthusiast who thrives on steering projects from their initial spark to successful fruition, Rob Koch is Principal at Slalom Build, AWS Hero, and Co-chair of the CNCF Deaf and Hard of Hearing Working Group. His expertise in architecting event-driven systems is firmly rooted... Read More →

Milad Vafaeifard

Lead Software Engineer, Epam

Milad Vafaeifard, a Lead Software Engineer at EPAM Systems, has 9+ years of web design and development expertise. Deaf but undeterred, he is the creative force behind Sign Language Tech and an active contributor to a YouTube channel focused on tech content for the signing tech community... Read More →

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 2 | 255 EF

AI + ML

Content Experience Level Any

12:10pm MST

Operationalizing High-Performance GPU Clusters in Kubernetes: A Case Study of Databricks' DBRX - Will Gleich & Wai Wu, Databricks

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 1 | Hall DE

Training large language models (LLMs) on GPUs within Kubernetes environments involves significant configuration and complexity, often leading to unique failure scenarios. This presentation will cover the lessons learned from training DBRX, a state-of-the-art LLM, that we developed on a 400-node cluster with a primary workload utilizing 3072 GPUs and the tooling needed to measure and maintain a healthy fleet of nodes and underlying interconnect fabric. This will include: * How we implemented GPU health detection leveraging Prometheus and DCGM Exporter * How we monitor GPU Direct Remote Direct Memory Access (GDRDMA) and the challenges of monitoring components that bypass CPU * Discussion of failure scenarios during training, and how they were addressed Databricks Mosaic AI Training leverages GPU clusters across many cloud providers to maximize availability; we will also discuss the variations we see and how we had to engineer around them.

Speakers

Wai Wu

Databricks

Will Gleich

Sr. DevOps Engineer, Databricks

Will Gleich is a Sr. DevOps engineer at Databricks specializing in MLOps and Site Reliability Engineering.

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Intermediate

2:30pm MST

Architecting the Future of AI: From Cloud-Native Orchestration to Advanced LLMOps - Ion Stoica, Anyscale

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 2 | 255 EF

With the groundbreaking release of ChatGPT, large language models (LLMs) have taken the world by storm: they have enabled new applications, have exacerbated GPU shortage, and raised new questions about their answers’ veracity. This talk delves into an AI stack, encompassing cloud-native orchestration, distributed computing, and advanced LLMOps. Key topics include: - Kubernetes: The foundational technology that seamlessly manages AI workloads across diverse cloud environments. - Ray: The versatile, open-source framework that streamlines the development and scaling of distributed applications. - vLLM: The cutting-edge, high-performance, and memory-efficient inference and serving engine designed specifically for large language models. Attendees will gain insights into the architecture and integration of these powerful tools, driving innovation and efficiency in the deployment of AI solutions.

Speakers

Ion Stoica

Co-founder, executive chairman & president, Anyscale

Ion Stoica is a Professor in the EECS Department at the University of California at Berkeley, and the Director of SkyLab. He is currently doing research on cloud computing and AI systems. Past work includes Ray, Apache Spark, Apache Mesos, Tachyon, Chord DHT, and Dynamic Packet State... Read More →

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 2 | 255 EF

AI + ML

Content Experience Level Any

2:30pm MST

Optimizing LLM Performance in Kubernetes with OpenTelemetry - Ashok Chandrasekar, Google & Liudmila Molkova, Microsoft

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Hall DE

Large Language Models are increasing in popularity and their deployments on Kubernetes have steadily increased. LLM applications bring new usage patterns that the industry does not have the expertise in. At the same time, there is a lack of observability in these deployments which makes it difficult to debug performance issues. We will present an end to end walkthrough of how you can leverage client and server LLM observability using Open Telemetry based on the recent efforts in the Kubernetes and Open Telemetry communities to standardize these across LLM clients and model servers. We will also demonstrate how to troubleshoot a real-world performance issue in your LLM deployment and how to optimize your LLM server setup for better performance on Kubernetes. We'll show how to use Kubernetes autoscaling based on custom model server metrics and demonstrate how they offer a superior alternative to using GPU utilization metrics for such deployments.

Speakers

Liudmila Molkova

Principal Software Engineer, Microsoft

Liudmila Molkova is a Principal Software Engineer at Microsoft working on observability and Azure client libraries. She is a co-author of distributed tracing implementations across the .NET ecosystem including HTTP client instrumentation and Azure Functions. Liudmila is an active... Read More →

Ashok Chandrasekar

Senior Software Engineer, Google

Ashok Chandrasekar is a Senior Software Engineer at Google working on AI/ML experience for Google Kubernetes Engine. Previously he was a Staff Engineer at VMware where he led the cluster lifecycle management area for Tanzu Mission Control. He has 7 years of Kubernetes experience working... Read More →

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Intermediate

3:25pm MST

A Tale of 2 Drivers: GPU Configuration on the Fly Using DRA - Alay Patel & Varun Ramachandra Sekar US, Nvidia

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 2 | 255 EF

NVIDIA’s GeForceNow is a cloud gaming service that allows users to stream video games from NVIDIA's servers to a wide range of devices, including PCs, Macs, Android devices, iOS devices, and smart TVs. Under the hood, it is powered by Kubernetes running Kubevirt VMs. For a seamless user experience, GeForceNow dynamically switches GPU drivers to accommodate either passing through an entire GPU or slicing it into multiple virtual GPUs, all while keeping utilization close to 100% across the datacenter. This poses significant challenges when using the traditional device plugin API provided by Kubernetes. In this talk, we explore GeForce Now’s journey to transition away from the traditional device plugin API in favor of Dynamic Resource Allocation (DRA). We'll share valuable insights for anyone looking to perform a similar migration of their own. Join us to learn about the challenges, solutions, and best practices to help optimize your GPU-accelerated workloads in the cloud.

Speakers

Alay Patel

Senior Software Engineer, Nvidia

Alay is a Senior Software Engineer at Nvidia where he works on cloud gaming service, exposing infrastructure for GPU workloads. He is passionate about open source with a focus on Kubernetes and platform engineering.

Varun Ramachandra Sekar US

Senior Software Engineer, Nvidia

Developer by day, Dog whisperer by night.

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 EF

AI + ML

Content Experience Level Intermediate

3:25pm MST

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes - David Gray, Red Hat

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | Hall DE

As generative AI language models improve, they are increasingly being integrated into business-critical applications. However, large language model (LLM) inference is a compute-intensive workload that often requires expensive GPU hardware. Making efficient use of these hardware resources in the public or private cloud is critical for managing costs and power usage. This talk introduces the KServe platform for deploying LLMs on Kubernetes and provides an overview of LLM inference performance concepts. Attendees will learn techniques to improve load balancing and autoscaling for LLM inference, such as leveraging KServe, Knative, and GPU operator features. Sharing test results, we will analyze the impact of these optimizations on key performance metrics, such as latency per token and tokens per second. This talk equips participants with strategies to maximize the efficiency of LLM inference deployments on Kubernetes, ultimately reducing costs and improving resource utilization.

Speakers

David Gray

Senior Software Engineer, Red Hat

David Gray is a Senior Software Engineer on the Performance and Scale team at Red Hat. His role involves analyzing and improving AI inference workloads on Kubernetes platforms. David is actively engaged in performance experimentation and analysis of running large language models in... Read More →

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Any

4:30pm MST

Making Kubernetes Simpler for Accelerated Workloads - Susan Wu, Google; Lucy Sweet, Uber; Mitch McKenzie, Weave; Aditya Shanker, Crusoe

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 2 | 255 EF

Kubernetes and the open-source ecosystem for AI frameworks have been great for LLM innovation, empowering developers to build applications that use natural language as the interface to data. Yet, many developers and cluster operators struggle to put these frameworks into production use. In this session, hear from several platform engineers responsible for designing core infrastructure supporting accelerated workloads, services, large language model training and inference pipelines. You can expect to come away with guidance, hear of pitfalls to watch out for and learn how they successfully abstracted the infrastructure complexity to improve their research users' experience and velocity. Panelists include: Lucy Sweet, Senior Software Engineer (Infrastructure), Uber, Mitch McKenzie, Site Reliability Engineer - Machine Learning Operations, Weave, Susan Wu, Outbound Product Manager, Google

Speakers

Aditya Shanker

Crusoe

Susan Wu

Outbound Product Manager, Google

Susan is an Outbound Product Manager for Google Cloud, focusing on GKE Networking and Network Security. She previously led product and technical marketing roles at VMware, Sun/Oracle, Canonical, Docker, Citrix and Midokura (part of Sony Group). She is a frequent speaker at conferences... Read More →

Lucy Sweet

Senior Software Engineer at Uber, Uber

Lucy is a Senior Software Engineer at Uber Denmark who works on software infrastructure

Mitch McKenzie

Weave

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 EF

AI + ML

Content Experience Level Intermediate

4:30pm MST

Platform Performance Optimization for AI - a Resource Management Perspective - Antti Kervinen, Intel & Dixita Narang, Google

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 1 | Hall DE

How much node resource management can affect AI workload performance? What options are there? What is the trade-off between total throughput and low latencies? In this talk we take a systematic approach to Platform Performance Optimization. We walk through the whole path from goal setting, gathering data, analysis, visualizations and conclusions. At each stop along the path we share our practical experiences in a case of LLM inference optimization. You will find many considerations, findings and practical tricks to take away. For instance, how to instrument PyTorch without touching the source or a container image, how to enable changing what we are measuring without new expensive benchmark reruns, and how much more we can learn from visualizations compared to numeric averages and percentiles. Finally we share real results from our case: how resource management increased total token throughput per worker node by more than 3.5x from the baseline.

Speakers

Antti Kervinen

Cloud Orchestration Software Engineer, Intel

Antti Kervinen is a Cloud Orchestration Software Engineer working at Intel, whose interest in Linux and distributed systems has led him from academic research of concurrency to the world of Kubernetes. When unplugged, Antti spends his time outdoors discovering wonders of nature.

Dixita Narang

Software Engineer, Google

Dixita Narang is a Software Engineer at Google on the Kubernetes Node team. With a primary focus on resource management within Kubernetes, Dixita is deeply involved in the development and advancement of the Memory QoS feature, which is currently in the alpha stage. She is a new contributor... Read More →

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Intermediate

5:25pm MST

Detecting and Overcoming GPU Failures During ML Training - Sarah Belghiti, Wayve & Ganeshkumar Ashokavardhanan, Microsoft

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | 155 EF

Scaling ML training demands powerful GPU infrastructure, and as model sizes and training scale increases, GPU failures become an expensive risk. From outright hardware faults to subtle performance degradation, undetected GPU problems can sabotage training jobs, inflating costs and slowing development. This talk dives into GPU failure challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on cloud provider and autonomous vehicle company experience, we will share best practices for efficient identification, remediation, and prevention of GPU failures. We will also explore cutting-edge ideas like CRIU and task pre-emption for GPU workloads.

Speakers

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft

Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →

Sarah Belghiti

ML Platform Engineer, Wayve

Sarah Belghiti is an ML Platform Engineer at Wayve, a leading developer of embodied intelligence for autonomous vehicles. She works on the infrastructure, scheduling and monitoring of ML workloads. With GPUs becoming an increasingly scarce resource, her focus has been on building... Read More →

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 155 EF

AI + ML

Content Experience Level Intermediate

5:25pm MST

Production AI at Scale: Cloudera’s Journey in Building a Robust Inference Platform - Zoram Thanga & Peter Ableda, Cloudera

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | Hall DE

In this session, we talk about Cloudera AI Inference Service, a secure, large scale platform for generative AI and predictive inference workloads, built using state of the art Kubernetes, CNCF and Apache open source projects. We take the audience through our journey in building this platform and share the experiences we gained along the way. The platform is built using openness, security, scalability, performance and standards compliance as guiding principles. We demonstrate that it is possible to be open and secure at the same time, and that organizations can incorporate production grade AI inferencing into their Big Data environments. This session will cover the architecture of the platform, and explain how we handle performance, scaling, authentication, fine grained authorization and audit logging, all of which are critical considerations for production inferencing.

Speakers

Peter Ableda

Director, Product Management, Cloudera

Peter Ableda is the Director of Product Management for Cloudera’s AI product suite, bringing over a decade of experience in data management and advanced analytics. Holding a Master of Science degree in Computer Science from the Budapest University of Technology, Peter has dedicated... Read More →

Zoram Thanga

Principal Engineer, Cloudera

Zoram is a Principal Engineer, Enterprise AI Platform in Cloudera. He has been working in the software industry for over 23 years, and has been involved in building clustering software, containers, file systems, analytical query engines, and ML/AI platforms. He is a committer in the... Read More →

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Intermediate

11:55am MST

Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet - Andrey Velichkevich, Apple & Yuki Iwai, CyberAgent, Inc.

Thursday November 14, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | Hall DE

Running model training on Kubernetes is challenging due to the complexity of AI/ML models, large training datasets, and various distributed strategies like data and model parallelism. It is crucial to configure failure handling, success criteria, and gang-scheduling for large-scale distributed training to ensure fault tolerance and elasticity. This talk will introduce the new Kubeflow TrainJob API, which democratizes distributed training and LLM fine-tuning on Kubernetes. The speakers will demonstrate how TrainJob integrates with Kubernetes JobSet to ensure scalable and efficient AI model training with simplified Python experience for Data Scientists. Additionally, they will explain the innovative concept of reusable and extendable training runtimes within TrainJob. The speakers will highlight how these capabilities empower data scientists to rapidly iterate on their ML development, making Kubernetes more accessible and beneficial for the entire ML ecosystem.

Speakers

Andrey Velichkevich

Senior Software Engineer, Apple

Andrey Velichkevich is a Senior Software Engineer at Apple and is a key contributor to the Kubeflow open-source project. He is a member of Kubeflow Steering Committee and a co-chair of Kubeflow AutoML and Training WG. Additionally, Andrey is an active member of the CNCF WG AI. He... Read More →

Yuki Iwai

Software Engineer, CyberAgent, Inc.

Yuki is a Software Engineer at CyberAgent, Inc. He works on the internal platform for machine-learning applications and high-performance computing. He is currently a Technical Lead for Kubeflow WG AutoML / Training. He is also a Kubernetes WG Batch active member and a Kubernetes... Read More →

Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Any

2:30pm MST

Unlocking Potential of Large Models in Production - Yuan Tang, Red Hat & Adam Tetelman, NVIDIA

Thursday November 14, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Hall DE

The recent paradigm shift from traditional ML to GenAI and LLMs has brought with it a new set of non-trivial LLMOps challenges around deployment, scaling, and operations that make building an inference platform to meet all business requirements an unsolved problem. This talk highlights these new challenges along with best-practices and solutions for building out large, scalable, and reliable inference platforms on top of cloud native technologies such as Kubernetes, Kubeflow, Kserve, and Knative. Which tools help effectively benchmark and assess the quality of an LLM? What type of storage and caching solutions enable quick auto-scaling and model downloads? How can you ensure your model is optimized for the specialized accelerators running in your cluster? How can A/B testing or rolling upgrades be accomplished with limited compute? What exactly do you monitor in an LLM? In this session we will use KServe as a case study to answer these questions and more.

Speakers

Yuan Tang

Principal Software Engineer, Red Hat

Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led AI infrastructure and platform teams at various companies. He holds leadership positions in open source projects, including Argo, Kubeflow, and Kubernetes. He's also a maintainer and... Read More →

Adam Tetelman

Principal Product Architect, NVIDIA

Adam Tetelman is a principal architect at NVIDIA leading cloud native initiatives and CNCF engagements across the company; building inference platforms for NVIDIA AI Enterprise and DGX Cloud. He has degrees in computational robotics, computer & systems engineering, and cognitive science... Read More →

Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Intermediate

3:25pm MST

Unlocking the Future of GPU Scheduling in Kubernetes with Reinforcement Learning - Nikunj Goyal, Adobe Systems & Aditi Gupta, Disney Plus Hotstar

Thursday November 14, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | Hall DE

Scaling up Multi GPU setup using Kubernetes for large scale ML projects has been a hot topic equally stressed upon among both the AI and cloud community. While Kubernetes is able to providing computing power by scheduling GPU nodes, certain issues like resource fragmentation and low utilization plague the performance and results in cost issues. Why Reinforcement Learning (RL) in particular one would ask. Unlike the other algorithms, RL shines in its unique ability to continuously adapt to changing environments and efficiently handle Complex and Multi-dimensional Objectives making it particularly suitable for the dynamic and heterogeneous nature of Kubernetes clusters. In this talk, we shall explore the current landscape of GPU scheduling and some state of the art RL algorithms proposed for scheduling. Their current impact on Kubernetes and the possible use of RLHF shall be dived deep into. We hope that audience gain more insights into these new ways of scheduling GPUs on Kubernetes.

Speakers

Aditi Gupta

Aditi Gupta, Software Developer at Disney + Hotstar, Disney Plus Hotstar

I'm Aditi Gupta, a Software Developer Engineer at Disney+ Hotstar. Graduated from Asia's largest tech college for women, Indira Gandhi Delhi Technical University,I've been deeply immersed in cloud-native technologies and AI/ML advancements. Skilled in containerisation, micro-service... Read More →

Nikunj Goyal

Developer at Adobe, AI and Machine Learning Specialist, Adobe Systems

Hi, I am Nikunj Goyal, working as a developer at Adobe and a Maths major from IIT Roorkee. I am working with AI and Machine Learning for some time mainly with Generative AI and graph based methods. I am a core part of Text-to-vector generation team at my org and previously worked... Read More →

Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Beginner

4:30pm MST

Which GPU Sharing Strategy Is Right for You? a Comprehensive Benchmark Study Using DRA - Kevin Klues & Yuan Chen, NVIDIA

Thursday November 14, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 1 | Hall DE

Dynamic Resource Allocation (DRA) is one of the most anticipated features to ever make its way into Kubernetes. It promises to revolutionize the way hardware devices are consumed and shared between workloads. In particular, DRA unlocks the ability to manage heterogeneous GPUs in a unified and configurable manner without the need for awkward solutions shoehorned on top of the existing device plugin API. In this talk, we use DRA to benchmark various GPU sharing strategies including Multi-Instance GPUs, Multi-Process Service (MPS), and CUDA Time-Slicing. As part of this, we provide guidance on the class of applications that can benefit from each strategy as well as how to combine different strategies in order to achieve optimal performance. The talk concludes with a discussion of potential challenges, future enhancements, and a live demo showcasing the use of each GPU sharing strategy with real-world applications.

Speakers

Kevin Klues

Distinguished Engineer, NVIDIA

Kevin Klues is a distinguished engineer on the NVIDIA Cloud Native team. Kevin has been involved in the design and implementation of a number of Kubernetes technologies, including the Topology Manager, the Kubernetes stack for Multi-Instance GPUs, and Dynamic Resource Allocation (DRA... Read More →

Yuan Chen

Principal Software Engineer, NVIDIA

Yuan Chen is a Principal Software Engineer at NVIDIA, working on building NVIDIA GPU Cloud for AI. He served as a Staff Software Engineer at Apple from 2019 to 2024, where he contributed to the development of Apple's Kubernetes infrastructure. Yuan has been an active code contributor... Read More →

Thursday November 14, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Any

5:25pm MST

Managing and Distributing AI Models Using OCI Standards and Harbor - Steven Zou & Steven Ren, VMware by Broadcom

Thursday November 14, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | Hall DE

Just as container images are vital to cloud-native technology, AI models are crucial to AI technology. Effectively, conveniently, and safely managing, maintaining, and distributing AI models is critical for supporting workflows like AI model training, inference, and application deployment. This presentation explores AI model management based on OCI standards and the Harbor project. Standardizing AI model structures and characteristics using OCI specifications and extension mechanisms like OCI Reference to link datasets and dependencies. When large models require efficient loading or privacy considerations, model replication or proxy with upstream repositories like Hugging Face becomes essential. Enhancing model distribution security through signing, vulnerability scanning, and policy-based governance is often necessary. Additionally, introducing acceleration mechanisms such as P2P can significantly improve the efficiency of large model loading.

Speakers

Steven Ren

Senior Manager, Broadcom

Steven Zou

Staff II Engineer, VMware by Broadcom

Steven Zou is a senior engineer with years of experience in cloud computing and cloud-native technology. He is currently working as a Staff II engineer at VMware, focusing on cloud-native and Kubernetes-related platform services. In addition, he is a core maintainer of the CNCF open-source... Read More →

Thursday November 14, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Any

5:25pm MST

Navigating Failures in Pods with Devices: Challenges and Solutions - Sergey Kanzhelev, Google & Mrunal Patel, Red Hat

Thursday November 14, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 2 | 250

Pods are no longer running with just CPU and Memory. We provision GPUs, network cards, request special placement of those devices and allocated memory. And the more efficient or effective you want your set up to be, the more complicated those device requirements are, the more chances you will hit an edge case Kubernetes has not accounted for yet. Come to the talk to learn from Node Maintainers about some of those shortcomings in Kubernetes. If you are only starting with AI/ML and devices, you will be interested to learn what to expect. If you have lots of experience, you may still learn new things. With the increased focus on AI/ML workloads, highlighting those scenarios is important. As Kubernetes plans to fix those problems, you can give feedback on what would work best for you.

Speakers

Sergey Kanzhelev

Staff Software Engineer, Google

Sergey Kanzhelev is a seasoned open source and cloud native maintainer working actively on Kubernetes. Sergey is serving as co-chair of SIG node. He is also one of the founders of OpenTelemetry. He is working on engineering aspect of software and its practical application. He is contributing... Read More →

Mrunal Patel

Distinguished Engineer, Red Hat

Mrunal Patel is a Senior Principal Software Engineer at Red Hat working on containers for Openshift. He is a maintainer of runc/libcontainer and the OCI runtime specification. He started the CRI-O runtime. He is a SIG-Node chair and tech lead.

Thursday November 14, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 2 | 250

AI + ML

Content Experience Level Any

11:00am MST

Better Together! GPU, TPU and NIC Topological Alignment with DRA - John Belamaric, Google & Patrick Ohly, Intel

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 250

AI/ML workloads on Kubernetes demand ultra-high performance. If your training or multi-GPU inference job spans nodes, your GPUs will use the network, talking through a NIC over local PCIe. But not all NICs are equal! To get the best performance, you need a NIC which is as "close" to the GPU as possible. Unfortunately, the Kubernetes extended resources API does not have enough information and does not give you control over which specific devices are assigned. Dynamic Resource Allocation, the successor API, gives you this power. Come to this session to learn about DRA, how it is improving overall device support in K8s, and how to use it to allocate multiple GPUs, NICs, and TPUs to get the maximum performance out of your infrastructure.

Speakers

Patrick Ohly

Principal Engineer, Intel

Patrick Ohly is a software engineer at Intel GmbH, Germany. In the past he has worked on performance analysis software for HPC clusters ("Intel Trace Analyzer and Collector") and cluster technology in general (PTP and hardware time stamping). Since January 2009 he has worked for Intel... Read More →

John Belamaric

Senior Staff Software Engineer, Google

John is a Sr Staff SWE, co-chair of K8s SIG Architecture and of K8s WG Device Management, helping lead efforts to improve how GPUs, TPUs, NICs and other devices are selected, shared, and configured in Kubernetes. He is also co-founder of Nephio, an LF project for K8s-based automation... Read More →

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 250

AI + ML

Content Experience Level Intermediate

11:55am MST

Building Massive-Scale Generative AI Services with Kubernetes and Open Source - John McBride, OpenSauced

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 2 | 250

At OpenSauced, we power over 40,000 generative AI inferences every day, all through our in-house platform ontop of Kubernetes. The cost of doing this kind of at-scale AI inference with a third party provider API would be astronomic. Thankfully, using Kubernetes, the public cloud, and open-source technologies, we've been able to scale with relatively low costs and a lean stack. In this talk, John will walk through the journey of building a production grade generative AI system using open source technologies, open large language models, and Kubernetes. We'll also explore why we chose to build ontop of Kubernetes for our AI workloads over using a third party provider, and how we're running and managing our AI/ML clusters today. Additionally, we'll dive into the techniques we used to groom our Retrieval-Augmented-Generation pipelines for efficiency ontop of Kubernetes and other practical tips for deploying your own AI services at-scale.

Speakers

John McBride

Sr. Software Engineer, OpenSauced

John is a Sr. Software Engineer at OpenSauced where he also serves as Head of Infrastructure and AI engineer. He is the maintainer of spf13/cobra, the Go CLI bootstrapping library used throughout the CNCF landscape. In the past, he has worked on open source Kuberenetes platforms... Read More →

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 250

AI + ML

Content Experience Level Any

11:55am MST

Improving Service Availability: Scaling Ahead with Machine Learning for HPA Optimization - Avni Sharma & Estela Ramirez, Intuit

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | Hall DE

In this talk, we will explore employing machine learning (ML) algorithms to enhance the Kubernetes autoscaling capabilities beyond the traditional, reactive horizontal pod autoscaler (HPA). Attendees will be introduced to how to leverage recommendation algorithms to predict future load and usage patterns, allowing for smarter, proactive scaling decisions. This approach not only ensures high availability and responsiveness of applications but also offers a pathway to substantial cost optimizations by preventing over-provisioning and minimizing resource wastage.

Speakers

Avni Sharma

Product Manager, Intuit

Avni is a Product Manager at Intuit, working on Intuit’s Modern SaaS Kubernetes platform. She also worked on ArgoCD as a PM. Avni is passionate about Developer tooling and strives to make developers' life easier by delivering them delightful experiences. She is also an Open Source... Read More →

Estela Ramirez

Software Engineer, Intuit Kubernetes Service, Intuit

Estela is a Software Engineer at Intuit focusing on Intuit Kubernetes Developer Platform. She works on abstracting the autoscaling for developers.

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Intermediate

2:00pm MST

Bloomberg’s Journey to Improve Resource Utilization in a Multi-Cluster Platform - Yao Weng & Leon Zhou, Bloomberg

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 2 | 250

Bloomberg provides an on-premises Data Science Platform (DSP) using cloud-native software to support internal AI model training. It runs on Kubernetes clusters spanning multiple data centers and featuring a diverse range of GPU types. However, managing such a large-scale and heterogeneous GPU environment poses many challenges, such as improving resource utilization, reducing operational costs, and scheduling workloads across different GPU types. In collaboration with the Karmada community, Bloomberg's DSP team has aimed to tackle these challenges by addressing multi-cluster batch job management problems. This talk will delve into the approaches the team has adopted, including: - Intelligently scheduling GPU workloads across multiple clusters - Using Karmada's resource interpreter to support Kubernetes Custom Resource Definitions (CRDs) on top of a multi-cluster architecture - Building a highly available Karmada control plane - Establishing a consistent training job submission interface

Speakers

Leon Zhou

Software Engineer, Bloomberg

Leon Zhou is a software engineer on the Data Science Platform engineering team at Bloomberg. With prior NLP experience, he is now building ML platforms to facilitate machine learning development. He is interested in ML infrastructure to enable large-scale training and complex pipelines... Read More →

Yao Weng

Senior Software Engineer, Bloomberg

Yao Weng is a Senior Software Engineer on Bloomberg’s Data Science Platform engineering team. She has contributed extensively to optimizing the company’s Kubernetes environment for high performance compute, model inference, and workflow orchestration. Yao Weng obtained her Ph.D... Read More →

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 250

AI + ML

Content Experience Level Intermediate

2:00pm MST

From Vectors to Pods: Integrating AI with Cloud Native - Rajas Kakodkar, Broadcom; Kevin Klues, NVIDIA; Joseph Sandoval, Adobe; Ricardo Rocha, CERN; Cathy Zhang, Intel

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 1 | Hall DE

The rise of AI is challenging long-standing assumptions about running cloud native workloads. AI demands hardware accelerators, vast data, efficient scheduling and exceptional scalability. Although Kubernetes remains the de facto choice, feedback from end users and collaboration with researchers and academia are essential to drive innovation, address gaps and integrate AI in cloud native. This panel features end users, AI infra researchers and leads of the CNCF AI and Kubernetes device management working groups focussed on: - Expanding beyond LLMs to explore AI for cloud native workload management, memory usage and debugging - Challenges with scheduling and scaling of AI workloads from the end user perspective - OSS Projects and innovation in AI and cloud native in the CNCF landscape - Improving resource utilisation and performance of AI workloads The next decade of Kubernetes will be shaped by AI. We don’t yet know what this will look like, come join us to discover it together.

Speakers

Ricardo Rocha

Lead Platforms Infrastructure, CERN

Ricardo leads the Platform Infrastructure team at CERN with a strong focus on cloud native deployments and machine learning. He has led for several years the internal effort to transition services and workloads to use cloud native technologies, as well as dissemination and training... Read More →

Kevin Klues

Distinguished Engineer, NVIDIA

Joseph Sandoval

Principal Product Manager, Adobe Inc.

Joseph Sandoval, a seasoned tech expert with 25 years in various roles running distributed systems, infrastructure platforms and thrives on empowering developers to scale their applications. An advocate for OpenSource software, he harnesses its transformative power to champion change... Read More →

Cathy Zhang

senior principal engineer, Intel

As a member of the CNCF TOC, Cathy has been sponsoring and guiding projects' applications for graduation/incubating, and reviewing/approving new sandbox projects. She has been a committee member for several KubeCon. Cathy is a currently Senior Principal Engineer at Intel, leading... Read More →

Rajas Kakodkar

Senior Member of Technical Staff | Tech Lead TAG Runtime CNCF, Broadcom

Rajas is a senior member of technical staff at Broadcom and a tech lead of the CNCF Technical Advisory Group, Runtime. He is actively involved in the AI working group in the CNCF. He is a Kubernetes contributor and has been a maintainer of the Kube Proxy Next Gen Project. He has also... Read More →

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Any

2:55pm MST

Cloud-Native AI: Wasm in Portable, Secure AI/ML Workloads - Miley Fu, Second State

Friday November 15, 2024 2:55pm - 3:30pm MST

Salt Palace | Level 2 | 250

In this talk, we present Wasm as a pioneering solution for running AI/ML workloads in cloud-native environments. Our focus is on demonstrating how Wasm (on the server) facilitates the execution of AI models, such as Llama3, Grok by X, Mixtral etc, across diverse cloud and edge platforms without sacrificing performance. We will discuss the advantages of using Rust and WebAssembly in AI/ML workloads, highlighting aspects like portability, speed, and security. Real-world examples will illustrate the deployment of AI inference models using Wasm runtime in Kubernetes environments, showcasing seamless orchestration and execution across varied devices. This session is aimed at cloud-native practitioners and AI/ML enthusiasts eager to explore innovative approaches in AI deployment.

Speakers

Miley Fu

DevRel, WasmEdge

Miley is a Developer Advocate with a passion for empowering developers to build and contribute to open source. With over 5 years of experience working on WasmEdge runtime in CNCF sandbox as the founding member, she talked at KubeCon, KCD Shenzhen, CloudDay Italy, DevRelCon, Open Source... Read More →

Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 2 | 250

AI + ML

Content Experience Level Beginner

2:55pm MST

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - Arpit Singh & Abhijit Paithankar, NVIDIA

Friday November 15, 2024 2:55pm - 3:30pm MST

Salt Palace | Level 1 | Hall DE

In K8s based ML platforms, job failures from hardware errors such as GPU malfunctions, network disruptions, ECC errors, and OOM events pose significant challenges. These failures cause resource underutilization, wasted engineering time, and high operational costs, often requiring users to resubmit jobs. Current AI/ML frameworks lack adequate fault tolerance strategies, typically requiring manual intervention and causing delays before jobs can resume. This talk explores fault tolerance strategies including naive job restarts on failure, job restarts with hot spares, and job restarts by replacing faulty nodes. We discuss how to achieve fault propagation by leveraging node and pod conditions and address gaps in fault discovery and error propagation in the existing Kubernetes ecosystem. Our talk will also include ways to enhance components like the node-problem-detector and introduce new elements to close the gaps in fault detection , propagation reaction and remediation.

Speakers

Abhijit Paithankar

Tech Lead and Engineering Manager, NVIDIA

Abhijit Paithankar is the AI and HPC Systems Tech Lead and Engineering Manager at NVIDIA, focusing on advanced computing technologies. Previously, he co-founded Crave.IO and served as CTO, and held key roles at Nutanix and VMware, developing critical hypervisor and storage solutions... Read More →

Arpit Singh (SW-CLOUD) US

Senior Software Engineer, Nvidia

Arpit Singh specializes in AI infrastructure at Nvidia, enhancing deep learning applications. Besides being a Kubernetes contributor, Arpit has 10+ years of experience spanning Nvidia, Nutanix and Cisco. He holds multiple patents (2 granted, 4+ pending) and has dual master's degr... Read More →

Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Beginner

4:00pm MST

Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines on K8s - Meenakshi Kaushik & Shiva Krishna Merla, NVIDIA

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 2 | 250

In this session, we'll cover best practices for deploying, scaling, and managing LLM inference pipelines on Kubernetes (K8s). We'll explore common patterns like inference, retrieval-augmented generation (RAG), and fine-tuning. Key challenges addressed include: [1]. Minimizing initial inference latency with model caching [2] Optimizing GPU usage with efficient scheduling, multi-GPU/node handling, and auto-quantization [3] Enhancing security and management with RBAC, monitoring, auto-scaling, and support for air-gapped clusters We'll also demonstrate building customizable pipelines for inference, RAG, and fine-tuning, and managing them post-deployment. Solutions include [1] a lightweight standalone tool built using operator pattern and [2] KServe, a robust open-source AI inference platform. This session will equip you to effectively manage LLM inference pipelines on K8s, improving performance, efficiency, and security

Speakers

Meenakshi Kaushik

Product Management, Nvidia

Meenakshi Kaushik leads product management for NIM Operator and KServe.. Meenakshi is interested in the AI and ML space and is excited to see how the technology can enhance human well-being and productivity.

Shiva Krishna Merla

Senior Software Engineer, NVIDIA

Shiva Krishna Merla is a senior software engineer on the NVIDIA Cloud Native team where he works on GPU cloud infrastructure, orchestration and monitoring. He is focused on enabling GPU-accelerated DL and AI workloads in container orchestration systems such as Kubernetes and OpenShift... Read More →

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 250

AI + ML

Content Experience Level Beginner

4:00pm MST

Divide and Conquer: Master GPU Partitioning and Visualize Savings with OpenCost - Kaysie Yu & Ally Ford, Microsoft

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 1 | Hall DE

Kubernetes is the ideal platform for running AI and ML workloads, such as LLMs. GPU nodes are often used for their parallel processing capabilities and higher performance benefits; however, they are known to be costly. Many factors impact the cost of running AI/ML workloads such as GPU utilization, GPU VM size, idle time, etc. These costs are often ignored and considered inherent in running GPU workloads. But if running workloads at scale and left unoptimized, costs will quickly spin out of control. In this talk, we leverage NVIDIA DCGM exporter with Prometheus for GPU metrics monitoring alongside OpenCost to measure the Kubernetes spend of our GPU workloads. We will provide an overview of OpenCost, highlighting its role in bridging the gap between the developer and platform teams through visibility and accountability of spend. We will demonstrate how to use the NVIDIA GPU Operator and how techniques such as partitioning can lead to significant cost savings.

Speakers

Ally Ford

Product Manager, Microsoft

Ally is a Product Manager on the Azure Kubernetes Service (AKS) team at Microsoft Azure. She spends her days collaborating with customers to design features that improve the end to end operator experience for both Linux and Windows users. Formerly she was a UX designer and project... Read More →

Kaysie

Product Manager, Microsoft

Kaysie Yu is a Product Manager on the Azure Kubernetes Service team at Microsoft. She works on cost management and optimization and is passionate about the convergence of FinOps and GreenOps, advocating for best practices that help organizations achieve cost efficiency while contributing... Read More →

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Intermediate

4:55pm MST

Best of Both Worlds: Integrating Slurm with Kubernetes in a Kubernetes Native Way - Eduardo Arango Gutierrez, NVIDIA & Angel Beltre, Sandia National Laboratories

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 2 | 250

It's not always clear which container orchestration system is best suited for a given use case. Slurm, for example, is often preferred over Kubernetes when running large-scale distributed workloads. As a result, organizations areoften faced a hard choice: do they deploy Slurm or Kubernetes to service the rising demands of their AI/ML workloads. In this talk, we introduce K-Foundry, an open-source custom controller for KCP that translates Kubernetes jobs to Slurm jobs and exposes Slurm nodes and cluster info as Kubernetes Custom Resource Definitions (CRDs). This integration combines Slurm’s robust job scheduling with Kubernetes' dynamic orchestration and API-driven ecosystem, easing the administration of both clusters through a common API. This session will end with a live demo, where attendees will see how this integration bridges the gap between cloud and HPC, facilitating resource management and optimizing performance for large-scale AI and LLM tasks.

Speakers

Eduardo Arango Gutierez DE

Senior systems software engineer, NVIDIA

Eduardo is a Senior Systems Software Engineer at NVIDIA, working on the Cloud Native Technologies team. Eduardo has focused on enabling users to build and deploy containers on distributed environments.

Angel Beltre

Senior Member of Technical Staff, Sandia National Laboratories

Angel Beltre serves as a senior member of the technical staff within the Scalable System Software department at Sandia National Laboratories. He is a contributor to the CSSE Computing-as-a-Service (CaaS) initiative, aimed at streamlining the deployment of modeling and simulation tools... Read More →

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 250

AI + ML

Content Experience Level Intermediate

4:55pm MST

Distributed Multi-Node Model Inference Using the LeaderWorkerSet API - Abdullah Gharaibeh & Rupeng Liu, Google

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 1 | Hall DE

Large Language Models have shown remarkable capabilities in various tasks, from text generation to code writing. However, the inference process for these models presents significant challenges. LLMs are computationally intensive, often requiring specialized hardware like TPUs or GPUs to achieve reasonable response times. In some cases their substantial size can strain the resources of a single machine. Specifically, models such as Gemini, Claude, and GPT4 are too large to fit on any single GPU or TPU device, let alone on any single multi-accelerator machine, necessitating what we refer to as multi-node server deployment where a single model server “backend” runs as a distributed process on multiple nodes to harness enough accelerator memory to fit and run the model. This talk presents LeaderWorkerSet, a new k8s API that enables multi-node model inference. We demonstrate its capabilities by orchestrating state of the art model servers such as vLLM and JetStream on both GPUs and TPUs.

Speakers

Abdullah Gharaibeh

Staff Software Engineer, Google

Abdullah is a staff software engineer at Google and sig-scheduling and working group batch co-chair. He works on Kubernetes and Google Kubernetes Engine, focusing on scheduling and batch workloads.

Rupeng Liu

Software engineer, Google

Rupeng Liu, a software engineer from the Google's Kubernetes inference team

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 1 | Hall DE

AI + ML

Content Experience Level Intermediate