Loading…
Attending this event?
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
AI + ML clear filter
arrow_back View All Dates
Friday, November 15
 

11:00am MST

Better Together! GPU, TPU and NIC Topological Alignment with DRA - John Belamaric, Google & Patrick Ohly, Intel
Friday November 15, 2024 11:00am - 11:35am MST
AI/ML workloads on Kubernetes demand ultra-high performance. If your training or multi-GPU inference job spans nodes, your GPUs will use the network, talking through a NIC over local PCIe. But not all NICs are equal! To get the best performance, you need a NIC which is as "close" to the GPU as possible. Unfortunately, the Kubernetes extended resources API does not have enough information and does not give you control over which specific devices are assigned. Dynamic Resource Allocation, the successor API, gives you this power. Come to this session to learn about DRA, how it is improving overall device support in K8s, and how to use it to allocate multiple GPUs, NICs, and TPUs to get the maximum performance out of your infrastructure.
Speakers
avatar for Patrick Ohly

Patrick Ohly

Principal Engineer, Intel
Patrick Ohly is a software engineer at Intel GmbH, Germany. In the past he has worked on performance analysis software for HPC clusters ("Intel Trace Analyzer and Collector") and cluster technology in general (PTP and hardware time stamping). Since January 2009 he has worked for Intel... Read More →
avatar for John Belamaric

John Belamaric

Senior Staff Software Engineer, Google
John is a Sr Staff SWE, co-chair of K8s SIG Architecture and of K8s WG Device Management, helping lead efforts to improve how GPUs, TPUs, NICs and other devices are selected, shared, and configured in Kubernetes. He is also co-founder of Nephio, an LF project for K8s-based automation... Read More →
Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 250
  AI + ML

11:55am MST

Building Massive-Scale Generative AI Services with Kubernetes and Open Source - John McBride, OpenSauced
Friday November 15, 2024 11:55am - 12:30pm MST
At OpenSauced, we power over 40,000 generative AI inferences every day, all through our in-house platform ontop of Kubernetes. The cost of doing this kind of at-scale AI inference with a third party provider API would be astronomic. Thankfully, using Kubernetes, the public cloud, and open-source technologies, we've been able to scale with relatively low costs and a lean stack. In this talk, John will walk through the journey of building a production grade generative AI system using open source technologies, open large language models, and Kubernetes. We'll also explore why we chose to build ontop of Kubernetes for our AI workloads over using a third party provider, and how we're running and managing our AI/ML clusters today. Additionally, we'll dive into the techniques we used to groom our Retrieval-Augmented-Generation pipelines for efficiency ontop of Kubernetes and other practical tips for deploying your own AI services at-scale.
Speakers
avatar for John McBride

John McBride

Sr. Software Engineer, OpenSauced
John is a Sr. Software Engineer at OpenSauced where he also serves as Head of Infrastructure and AI engineer. He is the maintainer of spf13/cobra, the Go CLI bootstrapping library used throughout the CNCF landscape. In the past, he has worked on open source Kuberenetes platforms... Read More →
Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 250
  AI + ML
  • Content Experience Level Any

11:55am MST

Improving Service Availability: Scaling Ahead with Machine Learning for HPA Optimization - Avni Sharma & Estela Ramirez, Intuit
Friday November 15, 2024 11:55am - 12:30pm MST
In this talk, we will explore employing machine learning (ML) algorithms to enhance the Kubernetes autoscaling capabilities beyond the traditional, reactive horizontal pod autoscaler (HPA). Attendees will be introduced to how to leverage recommendation algorithms to predict future load and usage patterns, allowing for smarter, proactive scaling decisions. This approach not only ensures high availability and responsiveness of applications but also offers a pathway to substantial cost optimizations by preventing over-provisioning and minimizing resource wastage.
Speakers
avatar for Avni Sharma

Avni Sharma

Product Manager, Intuit
Avni is a Product Manager at Intuit, working on Intuit’s Modern SaaS Kubernetes platform. She also worked on ArgoCD as a PM. Avni is passionate about Developer tooling and strives to make developers' life easier by delivering them delightful experiences. She is also an Open Source... Read More →
avatar for Estela Ramirez

Estela Ramirez

Software Engineer, Intuit Kubernetes Service, Intuit
Estela is a Software Engineer at Intuit focusing on Intuit Kubernetes Developer Platform. She works on abstracting the autoscaling for developers.
Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

2:00pm MST

Bloomberg’s Journey to Improve Resource Utilization in a Multi-Cluster Platform - Yao Weng & Leon Zhou, Bloomberg
Friday November 15, 2024 2:00pm - 2:35pm MST
Bloomberg provides an on-premises Data Science Platform (DSP) using cloud-native software to support internal AI model training. It runs on Kubernetes clusters spanning multiple data centers and featuring a diverse range of GPU types. However, managing such a large-scale and heterogeneous GPU environment poses many challenges, such as improving resource utilization, reducing operational costs, and scheduling workloads across different GPU types. In collaboration with the Karmada community, Bloomberg's DSP team has aimed to tackle these challenges by addressing multi-cluster batch job management problems. This talk will delve into the approaches the team has adopted, including: - Intelligently scheduling GPU workloads across multiple clusters - Using Karmada's resource interpreter to support Kubernetes Custom Resource Definitions (CRDs) on top of a multi-cluster architecture - Building a highly available Karmada control plane - Establishing a consistent training job submission interface
Speakers
avatar for Leon Zhou

Leon Zhou

Software Engineer, Bloomberg
Leon Zhou is a software engineer on the Data Science Platform engineering team at Bloomberg. With prior NLP experience, he is now building ML platforms to facilitate machine learning development. He is interested in ML infrastructure to enable large-scale training and complex pipelines... Read More →
avatar for Yao Weng

Yao Weng

Senior Software Engineer, Bloomberg
Yao Weng is a Senior Software Engineer on Bloomberg’s Data Science Platform engineering team. She has contributed extensively to optimizing the company’s Kubernetes environment for high performance compute, model inference, and workflow orchestration. Yao Weng obtained her Ph.D... Read More →
Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 250
  AI + ML

2:00pm MST

From Vectors to Pods: Integrating AI with Cloud Native - Rajas Kakodkar, Broadcom; Kevin Klues, NVIDIA; Joseph Sandoval, Adobe; Ricardo Rocha, CERN; Cathy Zhang, Intel
Friday November 15, 2024 2:00pm - 2:35pm MST
The rise of AI is challenging long-standing assumptions about running cloud native workloads. AI demands hardware accelerators, vast data, efficient scheduling and exceptional scalability. Although Kubernetes remains the de facto choice, feedback from end users and collaboration with researchers and academia are essential to drive innovation, address gaps and integrate AI in cloud native. This panel features end users, AI infra researchers and leads of the CNCF AI and Kubernetes device management working groups focussed on: - Expanding beyond LLMs to explore AI for cloud native workload management, memory usage and debugging - Challenges with scheduling and scaling of AI workloads from the end user perspective - OSS Projects and innovation in AI and cloud native in the CNCF landscape - Improving resource utilisation and performance of AI workloads The next decade of Kubernetes will be shaped by AI. We don’t yet know what this will look like, come join us to discover it together.
Speakers
avatar for Ricardo Rocha

Ricardo Rocha

Lead Platforms Infrastructure, CERN
Ricardo leads the Platform Infrastructure team at CERN with a strong focus on cloud native deployments and machine learning. He has led for several years the internal effort to transition services and workloads to use cloud native technologies, as well as dissemination and training... Read More →
avatar for Kevin Klues

Kevin Klues

Distinguished Engineer, NVIDIA
Kevin Klues is a distinguished engineer on the NVIDIA Cloud Native team. Kevin has been involved in the design and implementation of a number of Kubernetes technologies, including the Topology Manager, the Kubernetes stack for Multi-Instance GPUs, and Dynamic Resource Allocation (DRA... Read More →
avatar for Joseph Sandoval

Joseph Sandoval

Principal Product Manager, Adobe Inc.
Joseph Sandoval, a seasoned tech expert with 25 years in various roles running distributed systems, infrastructure platforms and thrives on empowering developers to scale their applications. An advocate for OpenSource software, he harnesses its transformative power to champion change... Read More →
avatar for Cathy Zhang

Cathy Zhang

senior principal engineer, Intel
As a member of the CNCF TOC, Cathy has been sponsoring and guiding projects' applications for graduation/incubating, and reviewing/approving new sandbox projects. She has been a committee member for several KubeCon. Cathy is a currently Senior Principal Engineer at Intel, leading... Read More →
avatar for Rajas Kakodkar

Rajas Kakodkar

Senior Member of Technical Staff | Tech Lead TAG Runtime CNCF, Broadcom
Rajas is a senior member of technical staff at Broadcom and a tech lead of the CNCF Technical Advisory Group, Runtime. He is actively involved in the AI working group in the CNCF. He is a Kubernetes contributor and has been a maintainer of the Kube Proxy Next Gen Project. He has also... Read More →
Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML
  • Content Experience Level Any

2:55pm MST

Cloud-Native AI: Wasm in Portable, Secure AI/ML Workloads - Miley Fu, Second State
Friday November 15, 2024 2:55pm - 3:30pm MST
In this talk, we present Wasm as a pioneering solution for running AI/ML workloads in cloud-native environments. Our focus is on demonstrating how Wasm (on the server) facilitates the execution of AI models, such as Llama3, Grok by X, Mixtral etc, across diverse cloud and edge platforms without sacrificing performance. We will discuss the advantages of using Rust and WebAssembly in AI/ML workloads, highlighting aspects like portability, speed, and security. Real-world examples will illustrate the deployment of AI inference models using Wasm runtime in Kubernetes environments, showcasing seamless orchestration and execution across varied devices. This session is aimed at cloud-native practitioners and AI/ML enthusiasts eager to explore innovative approaches in AI deployment.
Speakers
avatar for Miley Fu

Miley Fu

DevRel, WasmEdge
Miley is a Developer Advocate with a passion for empowering developers to build and contribute to open source. With over 5 years of experience working on WasmEdge runtime in CNCF sandbox as the founding member, she talked at KubeCon, KCD Shenzhen, CloudDay Italy, DevRelCon, Open Source... Read More →
Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 2 | 250
  AI + ML

2:55pm MST

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - Arpit Singh & Abhijit Paithankar, NVIDIA
Friday November 15, 2024 2:55pm - 3:30pm MST
In K8s based ML platforms, job failures from hardware errors such as GPU malfunctions, network disruptions, ECC errors, and OOM events pose significant challenges. These failures cause resource underutilization, wasted engineering time, and high operational costs, often requiring users to resubmit jobs. Current AI/ML frameworks lack adequate fault tolerance strategies, typically requiring manual intervention and causing delays before jobs can resume. This talk explores fault tolerance strategies including naive job restarts on failure, job restarts with hot spares, and job restarts by replacing faulty nodes. We discuss how to achieve fault propagation by leveraging node and pod conditions and address gaps in fault discovery and error propagation in the existing Kubernetes ecosystem. Our talk will also include ways to enhance components like the node-problem-detector and introduce new elements to close the gaps in fault detection , propagation reaction and remediation.
Speakers
avatar for Abhijit Paithankar

Abhijit Paithankar

Tech Lead and Engineering Manager, NVIDIA
Abhijit Paithankar is the AI and HPC Systems Tech Lead and Engineering Manager at NVIDIA, focusing on advanced computing technologies. Previously, he co-founded Crave.IO and served as CTO, and held key roles at Nutanix and VMware, developing critical hypervisor and storage solutions... Read More →
avatar for Arpit Singh (SW-CLOUD) US

Arpit Singh (SW-CLOUD) US

Senior Software Engineer, Nvidia
Arpit Singh specializes in AI infrastructure at Nvidia, enhancing deep learning applications. Besides being a Kubernetes contributor, Arpit has 10+ years of experience spanning Nvidia, Nutanix and Cisco. He holds multiple patents (2 granted, 4+ pending) and has dual master's degr... Read More →
Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

4:00pm MST

Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines on K8s - Meenakshi Kaushik & Shiva Krishna Merla, NVIDIA
Friday November 15, 2024 4:00pm - 4:35pm MST
In this session, we'll cover best practices for deploying, scaling, and managing LLM inference pipelines on Kubernetes (K8s). We'll explore common patterns like inference, retrieval-augmented generation (RAG), and fine-tuning. Key challenges addressed include: [1]. Minimizing initial inference latency with model caching [2] Optimizing GPU usage with efficient scheduling, multi-GPU/node handling, and auto-quantization [3] Enhancing security and management with RBAC, monitoring, auto-scaling, and support for air-gapped clusters We'll also demonstrate building customizable pipelines for inference, RAG, and fine-tuning, and managing them post-deployment. Solutions include [1] a lightweight standalone tool built using operator pattern and [2] KServe, a robust open-source AI inference platform. This session will equip you to effectively manage LLM inference pipelines on K8s, improving performance, efficiency, and security
Speakers
avatar for Meenakshi Kaushik

Meenakshi Kaushik

Product Management, Nvidia
Meenakshi Kaushik leads product management for NIM Operator and KServe.. Meenakshi is interested in the AI and ML space and is excited to see how the technology can enhance human well-being and productivity.
avatar for Shiva Krishna Merla

Shiva Krishna Merla

Senior Software Engineer, NVIDIA
Shiva Krishna Merla is a senior software engineer on the NVIDIA Cloud Native team where he works on GPU cloud infrastructure, orchestration and monitoring. He is focused on enabling GPU-accelerated DL and AI workloads in container orchestration systems such as Kubernetes and OpenShift... Read More →
Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 250
  AI + ML

4:00pm MST

Divide and Conquer: Master GPU Partitioning and Visualize Savings with OpenCost - Kaysie Yu & Ally Ford, Microsoft
Friday November 15, 2024 4:00pm - 4:35pm MST
Kubernetes is the ideal platform for running AI and ML workloads, such as LLMs. GPU nodes are often used for their parallel processing capabilities and higher performance benefits; however, they are known to be costly. Many factors impact the cost of running AI/ML workloads such as GPU utilization, GPU VM size, idle time, etc. These costs are often ignored and considered inherent in running GPU workloads. But if running workloads at scale and left unoptimized, costs will quickly spin out of control. In this talk, we leverage NVIDIA DCGM exporter with Prometheus for GPU metrics monitoring alongside OpenCost to measure the Kubernetes spend of our GPU workloads. We will provide an overview of OpenCost, highlighting its role in bridging the gap between the developer and platform teams through visibility and accountability of spend. We will demonstrate how to use the NVIDIA GPU Operator and how techniques such as partitioning can lead to significant cost savings.
Speakers
avatar for Ally Ford

Ally Ford

Product Manager, Microsoft
Ally is a Product Manager on the Azure Kubernetes Service (AKS) team at Microsoft Azure. She spends her days collaborating with customers to design features that improve the end to end operator experience for both Linux and Windows users. Formerly she was a UX designer and project... Read More →
avatar for Kaysie

Kaysie

Product Manager, Microsoft
Kaysie Yu is a Product Manager on the Azure Kubernetes Service team at Microsoft. She works on cost management and optimization and is passionate about the convergence of FinOps and GreenOps, advocating for best practices that help organizations achieve cost efficiency while contributing... Read More →
Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

4:55pm MST

Best of Both Worlds: Integrating Slurm with Kubernetes in a Kubernetes Native Way - Eduardo Arango Gutierrez, NVIDIA & Angel Beltre, Sandia National Laboratories
Friday November 15, 2024 4:55pm - 5:30pm MST
It's not always clear which container orchestration system is best suited for a given use case. Slurm, for example, is often preferred over Kubernetes when running large-scale distributed workloads. As a result, organizations areoften faced a hard choice: do they deploy Slurm or Kubernetes to service the rising demands of their AI/ML workloads. In this talk, we introduce K-Foundry, an open-source custom controller for KCP that translates Kubernetes jobs to Slurm jobs and exposes Slurm nodes and cluster info as Kubernetes Custom Resource Definitions (CRDs). This integration combines Slurm’s robust job scheduling with Kubernetes' dynamic orchestration and API-driven ecosystem, easing the administration of both clusters through a common API. This session will end with a live demo, where attendees will see how this integration bridges the gap between cloud and HPC, facilitating resource management and optimizing performance for large-scale AI and LLM tasks.
Speakers
avatar for Eduardo Arango Gutierez DE

Eduardo Arango Gutierez DE

Senior systems software engineer, NVIDIA
Eduardo is a Senior Systems Software Engineer at NVIDIA, working on the Cloud Native Technologies team. Eduardo has focused on enabling users to build and deploy containers on distributed environments.
avatar for Angel Beltre

Angel Beltre

Senior Member of Technical Staff, Sandia National Laboratories
Angel Beltre serves as a senior member of the technical staff within the Scalable System Software department at Sandia National Laboratories. He is a contributor to the CSSE Computing-as-a-Service (CaaS) initiative, aimed at streamlining the deployment of modeling and simulation tools... Read More →
Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 250
  AI + ML

4:55pm MST

Distributed Multi-Node Model Inference Using the LeaderWorkerSet API - Abdullah Gharaibeh & Rupeng Liu, Google
Friday November 15, 2024 4:55pm - 5:30pm MST
Large Language Models have shown remarkable capabilities in various tasks, from text generation to code writing. However, the inference process for these models presents significant challenges. LLMs are computationally intensive, often requiring specialized hardware like TPUs or GPUs to achieve reasonable response times. In some cases their substantial size can strain the resources of a single machine. Specifically, models such as Gemini, Claude, and GPT4 are too large to fit on any single GPU or TPU device, let alone on any single multi-accelerator machine, necessitating what we refer to as multi-node server deployment where a single model server “backend” runs as a distributed process on multiple nodes to harness enough accelerator memory to fit and run the model. This talk presents LeaderWorkerSet, a new k8s API that enables multi-node model inference. We demonstrate its capabilities by orchestrating state of the art model servers such as vLLM and JetStream on both GPUs and TPUs.
Speakers
avatar for Abdullah Gharaibeh

Abdullah Gharaibeh

Staff Software Engineer, Google
Abdullah is a staff software engineer at Google and sig-scheduling and working group batch co-chair. He works on Kubernetes and Google Kubernetes Engine, focusing on scheduling and batch workloads.
avatar for Rupeng Liu

Rupeng Liu

Software engineer, Google
Rupeng Liu, a software engineer from the Google's Kubernetes inference team
Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML
 

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date - 
  • 🚨 Contribfest
  • 🪧 Poster Sessions
  • AI + ML
  • Breaks
  • ⚡ Lightning Talks
  • Cloud Native Experience
  • Cloud Native Novice
  • CNCF-hosted Co-located Events
  • Connectivity
  • Data Processing + Storage
  • Emerging + Advanced
  • Experiences
  • Keynote Sessions
  • Maintainer Track
  • Observability
  • Operations + Performance
  • Platform Engineering
  • Project Opportunties
  • Registration
  • SDLC
  • Security
  • Solutions Showcase
  • Sponsor-hosted Co-located Event
  • Tutorials