KubeCon + CloudNativeCon North America 2024: Full Schedule

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

arrow_back View All Dates

11:00am MST

Can You Put a Price Tag on Open Source? - Mario Fahlandt, Kubermatic & Bob Killen, CNCF

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 255 E

Earlier this year, the Harvard Business School released the paper titled “The Value of Open Source Software,” estimating the worldwide value of OSS at 8.8 trillion, and on average, it would cost companies at least 3.5x more to develop similar projects internally. Yet, many organizations and engineers struggle to understand or realize this kind of value from contributing to these projects. In this talk, Bob and Mario will discuss the many benefits individuals and companies can achieve by contributing to open source and guide you through the first steps to becoming a contributor. They will also cover how to develop a lightweight open source strategy and convince your organization that an open source first approach can yield great returns.

Speakers

Mario Fahlandt

Service Delivery Architect, Kubermatic

Mario is working as a Customer Delivery Architect @Kubermatic with the focus on planning and building concepts and architecture for Infrastructure in the cloud native world.He started the GDG Munich for Cloud and became a GDE in 2019. In the Kubernetes project he is involved in SIG-ContribEx... Read More →

Bob Killen

Senior Technical Program Manager, CNCF

Bob is a Program Manager at the Google Open Source Programs Office with a focus on Cloud Native computing. He serves the Kubernetes project as a Steering Committee member and chair of the Contributor Experience SIG. Bob comes from an academic background, spending 15 years at the University... Read More →

Can you put a price on open source KCNA24 pdf

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 255 E

Cloud Native Experience

Content Experience Level Any

11:55am MST

Improving Service Availability: Scaling Ahead with Machine Learning for HPA Optimization - Avni Sharma & Estela Ramirez, Intuit

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 2 | 255 E

In this talk, we will explore employing machine learning (ML) algorithms to enhance the Kubernetes autoscaling capabilities beyond the traditional, reactive horizontal pod autoscaler (HPA). Attendees will be introduced to how to leverage recommendation algorithms to predict future load and usage patterns, allowing for smarter, proactive scaling decisions. This approach not only ensures high availability and responsiveness of applications but also offers a pathway to substantial cost optimizations by preventing over-provisioning and minimizing resource wastage.

Speakers

Avni Sharma

Product Manager, Intuit

Avni is a Product Manager at Intuit, working on Intuit’s Modern SaaS Kubernetes platform. She also worked on ArgoCD as a PM. Avni is passionate about Developer tooling and strives to make developers' life easier by delivering them delightful experiences. She is also an Open Source... Read More →

Estela Ramirez

Software Engineer, Intuit Kubernetes Service, Intuit

Estela is a Software Engineer at Intuit focusing on Intuit Kubernetes Developer Platform. She works on abstracting the autoscaling for developers.

Improving Service Availability Scaling ahead pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate
Presentation Slides Attached Yes

2:55pm MST

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - Arpit Singh & Abhijit Paithankar, NVIDIA

Friday November 15, 2024 2:55pm - 3:30pm MST

Salt Palace | Level 2 | 255 E

In K8s based ML platforms, job failures from hardware errors such as GPU malfunctions, network disruptions, ECC errors, and OOM events pose significant challenges. These failures cause resource underutilization, wasted engineering time, and high operational costs, often requiring users to resubmit jobs. Current AI/ML frameworks lack adequate fault tolerance strategies, typically requiring manual intervention and causing delays before jobs can resume. This talk explores fault tolerance strategies including naive job restarts on failure, job restarts with hot spares, and job restarts by replacing faulty nodes. We discuss how to achieve fault propagation by leveraging node and pod conditions and address gaps in fault discovery and error propagation in the existing Kubernetes ecosystem. Our talk will also include ways to enhance components like the node-problem-detector and introduce new elements to close the gaps in fault detection , propagation reaction and remediation.

Speakers

Abhijit Paithankar

Tech Lead and Engineering Manager, NVIDIA

Abhijit Paithankar is the AI and HPC Systems Tech Lead and Engineering Manager at NVIDIA, focusing on advanced computing technologies. Previously, he co-founded Crave.IO and served as CTO, and held key roles at Nutanix and VMware, developing critical hypervisor and storage solutions... Read More →

Arpit Singh (SW-CLOUD) US

Senior Software Engineer, Nvidia

Arpit Singh specializes in AI infrastructure at Nvidia, enhancing deep learning applications. Besides being a Kubernetes contributor, Arpit has 10+ years of experience spanning Nvidia, Nutanix and Cisco. He holds multiple patents (2 granted, 4+ pending) and has dual master's degr... Read More →

Fault Tolerance AI workloads pdf

Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Beginner

4:00pm MST

Divide and Conquer: Master GPU Partitioning and Visualize Savings with OpenCost - Kaysie Yu & Ally Ford, Microsoft

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 2 | 255 E

Kubernetes is the ideal platform for running AI and ML workloads, such as LLMs. GPU nodes are often used for their parallel processing capabilities and higher performance benefits; however, they are known to be costly. Many factors impact the cost of running AI/ML workloads such as GPU utilization, GPU VM size, idle time, etc. These costs are often ignored and considered inherent in running GPU workloads. But if running workloads at scale and left unoptimized, costs will quickly spin out of control. In this talk, we leverage NVIDIA DCGM exporter with Prometheus for GPU metrics monitoring alongside OpenCost to measure the Kubernetes spend of our GPU workloads. We will provide an overview of OpenCost, highlighting its role in bridging the gap between the developer and platform teams through visibility and accountability of spend. We will demonstrate how to use the NVIDIA GPU Operator and how techniques such as partitioning can lead to significant cost savings.

Speakers

Ally Ford

Product Manager, Microsoft

Ally is a Product Manager on the Azure Kubernetes Service (AKS) team at Microsoft Azure. She spends her days collaborating with customers to design features that improve the end to end operator experience for both Linux and Windows users. Formerly she was a UX designer and project... Read More →

Kaysie

Product Manager, Microsoft

Kaysie Yu is a Product Manager on the Azure Kubernetes Service team at Microsoft. She works on cost management and optimization and is passionate about the convergence of FinOps and GreenOps, advocating for best practices that help organizations achieve cost efficiency while contributing... Read More →

Divide and Conquer Master GPU Partitioning and Visualize Savings with OpenCost pdf

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

4:55pm MST

Distributed Multi-Node Model Inference Using the LeaderWorkerSet API - Abdullah Gharaibeh & Rupeng Liu, Google

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 2 | 255 E

Large Language Models have shown remarkable capabilities in various tasks, from text generation to code writing. However, the inference process for these models presents significant challenges. LLMs are computationally intensive, often requiring specialized hardware like TPUs or GPUs to achieve reasonable response times. In some cases their substantial size can strain the resources of a single machine. Specifically, models such as Gemini, Claude, and GPT4 are too large to fit on any single GPU or TPU device, let alone on any single multi-accelerator machine, necessitating what we refer to as multi-node server deployment where a single model server “backend” runs as a distributed process on multiple nodes to harness enough accelerator memory to fit and run the model. This talk presents LeaderWorkerSet, a new k8s API that enables multi-node model inference. We demonstrate its capabilities by orchestrating state of the art model servers such as vLLM and JetStream on both GPUs and TPUs.

Speakers

Abdullah Gharaibeh

Staff Software Engineer, Google

Abdullah is a staff software engineer at Google and sig-scheduling and working group batch co-chair. He works on Kubernetes and Google Kubernetes Engine, focusing on scheduling and batch workloads.

Rupeng Liu

Software engineer, Google

Rupeng Liu, a software engineer from the Google's Kubernetes inference team

LeaderWorkerSet for distributed inference.pptx (1) pdf

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate