Loading…
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
or to bookmark your favorites and sync them to your phone or calendar.
strong>Salt Palace | Level 2 | 255 E [clear filter]
arrow_back View All Dates
Friday, November 15
 

11:00am MST

Can You Put a Price Tag on Open Source? - Mario Fahlandt, Kubermatic & Bob Killen, CNCF
Friday November 15, 2024 11:00am - 11:35am MST
Earlier this year, the Harvard Business School released the paper titled “The Value of Open Source Software,” estimating the worldwide value of OSS at 8.8 trillion, and on average, it would cost companies at least 3.5x more to develop similar projects internally. Yet, many organizations and engineers struggle to understand or realize this kind of value from contributing to these projects. In this talk, Bob and Mario will discuss the many benefits individuals and companies can achieve by contributing to open source and guide you through the first steps to becoming a contributor. They will also cover how to develop a lightweight open source strategy and convince your organization that an open source first approach can yield great returns.
Speakers
avatar for Mario Fahlandt

Mario Fahlandt

Service Delivery Architect, Kubermatic
Mario is working as a Customer Delivery Architect @Kubermatic with the focus on planning and building concepts and architecture for Infrastructure in the cloud native world.He started the GDG Munich for Cloud and became a GDE in 2019. In the Kubernetes project he is involved in SIG-ContribEx... Read More →
avatar for Bob Killen

Bob Killen

Senior Technical Program Manager, CNCF
Bob is a Program Manager at the Google Open Source Programs Office with a focus on Cloud Native computing. He serves the Kubernetes project as a Steering Committee member and chair of the Contributor Experience SIG. Bob comes from an academic background, spending 15 years at the University... Read More →
Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 255 E
  Cloud Native Experience
  • Content Experience Level Any

11:55am MST

Improving Service Availability: Scaling Ahead with Machine Learning for HPA Optimization - Avni Sharma & Estela Ramirez, Intuit
Friday November 15, 2024 11:55am - 12:30pm MST
In this talk, we will explore employing machine learning (ML) algorithms to enhance the Kubernetes autoscaling capabilities beyond the traditional, reactive horizontal pod autoscaler (HPA). Attendees will be introduced to how to leverage recommendation algorithms to predict future load and usage patterns, allowing for smarter, proactive scaling decisions. This approach not only ensures high availability and responsiveness of applications but also offers a pathway to substantial cost optimizations by preventing over-provisioning and minimizing resource wastage.
Speakers
avatar for Avni Sharma

Avni Sharma

Product Manager, Intuit
Avni is a Product Manager at Intuit, working on Intuit’s Modern SaaS Kubernetes platform. She also worked on ArgoCD as a PM. Avni is passionate about Developer tooling and strives to make developers' life easier by delivering them delightful experiences. She is also an Open Source... Read More →
avatar for Estela Ramirez

Estela Ramirez

Software Engineer, Intuit Kubernetes Service, Intuit
Estela is a Software Engineer at Intuit focusing on Intuit Kubernetes Developer Platform. She works on abstracting the autoscaling for developers.
Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 255 E
  AI + ML

2:55pm MST

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - Arpit Singh & Abhijit Paithankar, NVIDIA
Friday November 15, 2024 2:55pm - 3:30pm MST
In K8s based ML platforms, job failures from hardware errors such as GPU malfunctions, network disruptions, ECC errors, and OOM events pose significant challenges. These failures cause resource underutilization, wasted engineering time, and high operational costs, often requiring users to resubmit jobs. Current AI/ML frameworks lack adequate fault tolerance strategies, typically requiring manual intervention and causing delays before jobs can resume. This talk explores fault tolerance strategies including naive job restarts on failure, job restarts with hot spares, and job restarts by replacing faulty nodes. We discuss how to achieve fault propagation by leveraging node and pod conditions and address gaps in fault discovery and error propagation in the existing Kubernetes ecosystem. Our talk will also include ways to enhance components like the node-problem-detector and introduce new elements to close the gaps in fault detection , propagation reaction and remediation.
Speakers
avatar for Abhijit Paithankar

Abhijit Paithankar

Tech Lead and Engineering Manager, NVIDIA
Abhijit Paithankar is the AI and HPC Systems Tech Lead and Engineering Manager at NVIDIA, focusing on advanced computing technologies. Previously, he co-founded Crave.IO and served as CTO, and held key roles at Nutanix and VMware, developing critical hypervisor and storage solutions... Read More →
avatar for Arpit Singh (SW-CLOUD) US

Arpit Singh (SW-CLOUD) US

Senior Software Engineer, Nvidia
Arpit Singh specializes in AI infrastructure at Nvidia, enhancing deep learning applications. Besides being a Kubernetes contributor, Arpit has 10+ years of experience spanning Nvidia, Nutanix and Cisco. He holds multiple patents (2 granted, 4+ pending) and has dual master's degr... Read More →
Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 2 | 255 E
  AI + ML

4:00pm MST

Divide and Conquer: Master GPU Partitioning and Visualize Savings with OpenCost - Kaysie Yu & Ally Ford, Microsoft
Friday November 15, 2024 4:00pm - 4:35pm MST
Kubernetes is the ideal platform for running AI and ML workloads, such as LLMs. GPU nodes are often used for their parallel processing capabilities and higher performance benefits; however, they are known to be costly. Many factors impact the cost of running AI/ML workloads such as GPU utilization, GPU VM size, idle time, etc. These costs are often ignored and considered inherent in running GPU workloads. But if running workloads at scale and left unoptimized, costs will quickly spin out of control. In this talk, we leverage NVIDIA DCGM exporter with Prometheus for GPU metrics monitoring alongside OpenCost to measure the Kubernetes spend of our GPU workloads. We will provide an overview of OpenCost, highlighting its role in bridging the gap between the developer and platform teams through visibility and accountability of spend. We will demonstrate how to use the NVIDIA GPU Operator and how techniques such as partitioning can lead to significant cost savings.
Speakers
avatar for Ally Ford

Ally Ford

Product Manager, Microsoft
Ally is a Product Manager on the Azure Kubernetes Service (AKS) team at Microsoft Azure. She spends her days collaborating with customers to design features that improve the end to end operator experience for both Linux and Windows users. Formerly she was a UX designer and project... Read More →
avatar for Kaysie

Kaysie

Product Manager, Microsoft
Kaysie Yu is a Product Manager on the Azure Kubernetes Service team at Microsoft. She works on cost management and optimization and is passionate about the convergence of FinOps and GreenOps, advocating for best practices that help organizations achieve cost efficiency while contributing... Read More →
Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 255 E
  AI + ML

4:55pm MST

Distributed Multi-Node Model Inference Using the LeaderWorkerSet API - Abdullah Gharaibeh & Rupeng Liu, Google
Friday November 15, 2024 4:55pm - 5:30pm MST
Large Language Models have shown remarkable capabilities in various tasks, from text generation to code writing. However, the inference process for these models presents significant challenges. LLMs are computationally intensive, often requiring specialized hardware like TPUs or GPUs to achieve reasonable response times. In some cases their substantial size can strain the resources of a single machine. Specifically, models such as Gemini, Claude, and GPT4 are too large to fit on any single GPU or TPU device, let alone on any single multi-accelerator machine, necessitating what we refer to as multi-node server deployment where a single model server “backend” runs as a distributed process on multiple nodes to harness enough accelerator memory to fit and run the model. This talk presents LeaderWorkerSet, a new k8s API that enables multi-node model inference. We demonstrate its capabilities by orchestrating state of the art model servers such as vLLM and JetStream on both GPUs and TPUs.
Speakers
avatar for Abdullah Gharaibeh

Abdullah Gharaibeh

Staff Software Engineer, Google
Abdullah is a staff software engineer at Google and sig-scheduling and working group batch co-chair. He works on Kubernetes and Google Kubernetes Engine, focusing on scheduling and batch workloads.
avatar for Rupeng Liu

Rupeng Liu

Software engineer, Google
Rupeng Liu, a software engineer from the Google's Kubernetes inference team
Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 255 E
  AI + ML
 

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date - 
  • 🚨 Contribfest
  • 🪧 Poster Sessions
  • AI + ML
  • Breaks
  • ⚡ Lightning Talks
  • Cloud Native Experience
  • Cloud Native Novice
  • CNCF-hosted Co-located Events
  • Connectivity
  • Data Processing + Storage
  • Diversity + Equity + Inclusion
  • Emerging + Advanced
  • Experiences
  • Keynote Sessions
  • Maintainer Track
  • Observability
  • Operations + Performance
  • Platform Engineering
  • Project Opportunities
  • Registration
  • SDLC
  • Security
  • Solutions Showcase
  • Sponsor-hosted Co-located Event
  • Tutorials