KubeCon + CloudNativeCon North America 2024: Full Schedule

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

5:55pm MST

⚡ Lightning Talk: Is Everyone O-KEDA? “Exciting” Lessons Learned in Our Journey to Use KEDA Pod Autoscaling - Brian Davis, Red Canary

Tuesday November 12, 2024 5:55pm - 6:00pm MST

Hyatt Regency | Level 4 | Regency Ballroom B

We thought that changing our Kubernetes pod autoscaler seemed like a really straightforward thing to do. With relative ease, we yanked out our old custom pod autoscaler and replaced it with KEDA. We were impressed with the flexibility and control we now had in our cluster, but then discovered a set of really hard lessons that no one had anticipated. In this lightning talk, I’ll hit the highlights of secondary issues we encountered due to such a seemingly simple change, such as Docker Hub rate limits, Kubernetes metrics server failures and their exciting impact on our cluster, AWS rate limits, and late night fights with Argo CD for control of pod maximums. Lastly, I’ll share my personal favorite topic: the “Night Club Theory” of autoscaling tuning. If you or someone you love is thinking of changing your autoscaler, I recommend spending 5 minutes with me to learn the things you should be aware of before you make the switch!

Speakers

Brian Davis

Principal Software Engineer, Red Canary

Brian Davis is a Principal Engineer at Red Canary and has built complex systems for the past two decades. His career started in signal processing algorithm research but has morphed through the years into software engineering, QA, system integration, system design, and architectur... Read More →

Is Everyone O KEDA Brian Davis pdf

Tuesday November 12, 2024 5:55pm - 6:00pm MST
Hyatt Regency | Level 4 | Regency Ballroom B

⚡ Lightning Talks, Operations + Performance

Content Experience Level Intermediate

6:00pm MST

⚡ Lightning Talk: Minimizing Data Loss Within the OpenTelemetry (OTel) Collector - Alex Kats, Capital One

Tuesday November 12, 2024 6:00pm - 6:05pm MST

Hyatt Regency | Level 4 | Regency Ballroom B

The OTel collector is meant to serve as a reliable and highly performant data pipeline. However, as a single component in a wider observability architecture, it is only as reliable as the downstream platforms/services it exports data to. The OTel collector has several built in mechanisms that aim to minimize the impact of unhealthy downstream exporters, including an out of the box sending queue with an additional configuration parameter for persistent queueing. There is a new component in the OTel contrib distribution, the Failover Connector. The Failover Connector allows for dynamic routing or “failover” of telemetry data based on downstream exporter health. This provides significant improvement to the data resiliency of the collector, as telemetry data can be continuously exported to a set of stable secondary locations, while the issues with the primary are resolved.

Speakers

Alex Kats

Software Engineer, Capital One

Alex is a software engineer at Capital One. Alex has significant experience within the Observability space, with an emphasis on OpenTelemetry (OTel). Alex is a member of the OpenTelemetry community and has been contributing to various components within the OTel toolset.

Alex Kats Minimizing Data Loss pdf

Tuesday November 12, 2024 6:00pm - 6:05pm MST
Hyatt Regency | Level 4 | Regency Ballroom B

⚡ Lightning Talks, Observability

Content Experience Level Intermediate

6:10pm MST

⚡ Lightning Talk: Safer Cluster Upgrades with Mixed Version Proxy - Richa Banker, Google

Tuesday November 12, 2024 6:10pm - 6:15pm MST

Hyatt Regency | Level 4 | Regency Ballroom B

Upgrading Kubernetes clusters often presents numerous challenges, including potential downtime, compatibility issues, and the complexity of managing multiple versions. The Mixed Version Proxy feature introduced in Kubernetes 1.28 aims to mitigate these challenges. This talk will delve into the technical intricacies of the Mixed Version Proxy, exploring its design and implementation. We will then highlight the substantial benefits it offers for cluster upgrades, such as minimizing downtime and enhancing overall reliability. Attendees will gain practical knowledge through (possibly a demonstration) on enabling and utilizing the Mixed Version Proxy. Finally, we will provide insights into the future roadmap for this feature, including upcoming beta releases and enhancements.

Speakers

Richa Banker

Software Engineer, Google

Currently a software engineer at Google. Exploring and contributing to OSS Kubernetes on the side.

Tuesday November 12, 2024 6:10pm - 6:15pm MST
Hyatt Regency | Level 4 | Regency Ballroom B

⚡ Lightning Talks, Operations + Performance

Content Experience Level Intermediate

11:15am MST

Architecting Tomorrow: The Heterogeneous Compute Resources for New Types of Workloads - Alexander Kanevskiy, Intel Finland

Wednesday November 13, 2024 11:15am - 11:50am MST

Salt Palace | Level 2 | 254 B

Imagine managing a set of diverse workloads on a Kubernetes node, operating across dozens of CPU cores and several memory zones. But do you truly comprehend the difference between one CPU core versus another? Are you aware of the impact that different memory zone might have on your workload's efficiency? Will optimisations for one type of workloads be helpful for another? Do you think that your ML workload will behave same way as e.g. Redis? This presentation delves deep into CPU internals, memory types (DRAM, HBM, CXL), and diverse cache/core types and layouts. Explore recent hardware advancements and their impact on workloads. We'll examine native compute resource allocation strategies from a hardware point of view, crucial for enhancing workload performance and optimising energy usage and cost efficiency. Join and learn details of the modern hardware architecture that gives you a framework to make more informed choices on hardware resource optimisation for your infrastructure.

Speakers

Alexander Kanevskiy

Principal Engineer, Cloud Orchestration Software, Intel Finland

Alexander is currently employed by Intel as Principal Engineer, Cloud Software, focusing on various aspects in Kubernetes: Resource Management, Device plugins for hardware accelerators, Cluster Lifecycle and Cluster APIs. Alexander has over 25+ years of experience in areas of Linux... Read More →

NA2024 Architecting Tomorrow The Heterogeneous Compute Resources for New Types of Workloads pdf

Wednesday November 13, 2024 11:15am - 11:50am MST
Salt Palace | Level 2 | 254 B

Emerging + Advanced

Content Experience Level Intermediate

12:10pm MST

Operationalizing High-Performance GPU Clusters in Kubernetes: Lessons Learned from Training Databricks DBRX - Will Gleich & Wai Wu, Databricks

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 2 | 255 E

Training large language models (LLMs) on GPUs within Kubernetes environments involves significant configuration and complexity, often leading to unique failure scenarios. This presentation will cover the lessons learned from training DBRX, a state-of-the-art LLM, that we developed on a 400-node cluster with a primary workload utilizing 3072 GPUs and the tooling needed to measure and maintain a healthy fleet of nodes and underlying interconnect fabric.
This will include:

How we implemented GPU health detection leveraging Prometheus and DCGM Exporter
How we monitor GPU Direct Remote Direct Memory Access (GDRDMA) and the challenges of monitoring components that bypass CPU
Discussion of failure scenarios during training, and how they were addressed Databricks Mosaic AI Training leverages GPU clusters across many cloud providers to maximize availability; we will also discuss the variations we see and how we had to engineer around them.

Speakers

Wai Wu

Software Engineer, Databricks

Hey everyone, I was born to deploy Kubernetes ꈍ ω ꈍ

Will Gleich

Sr. DevOps Engineer, Databricks

Will Gleich is a Sr. DevOps engineer at Databricks specializing in MLOps and Site Reliability Engineering.

Operationalizing High Performance GPU Clusters in Kubernetes Lessons Learned from Training Databricks DBRX pdf

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

12:10pm MST

Can Your Kubernetes Network Handle the Heat? Building Resilience with AI Chaos - Lior Lieberman, Google & Surya Seetharaman, Red Hat

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 1 | 155 E

Kubernetes networking is complex with many APIs, numerous configurations and potential failure points. In the rapidly evolving world of cloud-native applications, ensuring your Kubernetes network can withstand unexpected failures is not just an advantage—it is a necessity. In this talk Surya and Lior, holding distinct leadership roles in Gateway API and NetworkPolicy API, will demonstrate how you can leverage AI-powered Chaos Engineering to stress test Gateways, NetworkPolicies, and Services on a live cluster! They will share their experiences and lessons learned from using Litmus and enhancing K8sGPT to design and execute AI Chaos experiments, as well as focusing on how you can proactively find gaps and bottlenecks in the network infrastructure. This is a great opportunity to learn from real-world disruption scenarios and participate in a collaborative discussion on how we can leverage AI to build robust Kubernetes Networks.

Speakers

Surya Seetharaman

Principal Software Engineer, Red Hat Inc.

Surya is an Open Source advocate and contributor, active in the Kubernetes SIG-Network working group. She is working as a Principal Software Engineer at Red Hat in the OpenShift Networking team. Her areas of interest include Cloud Infrastructure and Networked Services and Systems... Read More →

Lior Lieberman

Site Reliability Engineer, Google

Lior is site reliability engineer at Google working on Google Compute Engine. He is a leading maintainer of ingress2gateway, and an active contributor to Kubernetes SIG network focused on Gateway API.

Can Your Kubernetes Network Handle the Heat Building Resilience with AI Chaos pdf

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

12:10pm MST

Automated Multi-Cloud, Multi-Flavor Kubernetes Cluster Upgrades Using Operators - Ziyuan Chen, Databricks

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 1 | 155 B

Databricks manages over a thousand k8s clusters across three major cloud providers which run critical workloads in cloud regions around the world. This talk describes the system we built to upgrade nodes’ operating system, k8s version, and other configs monthly, supporting EKS, AKS, GKE, and self-managed k8s. Our system is built on k8s operators and performs zero-downtime blue-green rolling updates, respects contracts with services with features like PDBs, maintenance windows, deferred node draining, and custom workload handling plugins. It enables easy rollbacks, has good observability, and incurs minimal human operational cost. This has allowed us to patch vulnerabilities and release infrastructure changes quickly and reliably across the fleet. We will also share our lessons learned on building several operators that work together using the controller-runtime framework, designing the declarative interfaces between them, and achieving consistent behavior across three clouds.

Speakers

Ziyuan Chen

Staff Software Engineer, Databricks

Ziyuan Chen is a software engineer at Databricks. He has worked on Databricks' compute and OS infrastructure areas.

Automated Multi Cloud, Multi Flavor Kubernetes Cluster Upgrades Using Operators pptx

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate
Presentation Slides Attached Yes

12:10pm MST

Automated Multi-Cloud Large Scale K8s Cluster Lifecycle Management - Sourav Khandelwal, Databricks

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 1 | Grand Ballroom H

I will present the system developed for cluster rotations across Databricks’ fleet of over a thousand cloud-managed k8s clusters on AWS, Azure, and GCP. Blue-green cluster rotations, or cluster swaps (upgrading by creating a new k8s cluster with a new version/configuration & shifting workloads from the old cluster), allow us to implement major infrastructure changes and upgrade k8s versions with low risk through staged rollouts, seamless rollbacks, zero downtime, and minimal operator intervention. Our system includes a k8s-style continuous reconciliation mechanism to manage cluster swap lifecycles, a fast and reliable cluster state change discovery system, and a k8s workload migration system. We will share methodologies and experiences in constructing this loosely coupled system that orchestrates product workloads and cloud provider APIs for automated cluster swaps. This session will explore the challenges faced, and the benefits of automating large-scale, multi-cloud k8s upgrades.

Speakers

Sourav Khandelwal

Sr. Software Engineer, Databricks

I am a seasoned software engineer with over 10 years of experience in designing and managing large-scale platforms in cloud-native environments. At Databricks, my significant contributions have been pivotal in launching our next-generation cloud infrastructure that helped to transition... Read More →

Automated Multi Cloud Large Scale K8s Cluster Lifecycle Management pdf

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | Grand Ballroom H

Platform Engineering

Content Experience Level Intermediate

12:10pm MST

The Hard Truth About GitOps and Database Rollbacks - Rotem Tamir, Ariga

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 2 | 250 AD

For two decades now, the common practice for handling rollbacks of database schema migrations has been pre-planned "down migration scripts". A closer examination of this widely accepted truth reveals critical gaps that result in teams relying on risky, manual operations to roll back schema migrations in times of crisis. In this talk, we show why our existing tools and practices cannot deliver on the GitOps promise of "declarative" and "continuously reconciled" workflows and how we can use the Operator Pattern to build a new solution for robust and safe schema rollbacks.

Speakers

Rotem Tamir

CTO, Ariga

Rotem Tamir (39), father of two. Co-founder and CTO of Ariga, co-maintainer of Atlas and Ent. Ex-data platform architect at Nexar, infrastructure team lead at ironSource.

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

2:30pm MST

Optimizing LLM Performance in Kubernetes with OpenTelemetry - Ashok Chandrasekar, Google & Liudmila Molkova, Microsoft

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 2 | 255 E

Large Language Models are increasing in popularity and their deployments on Kubernetes have steadily increased. LLM applications bring new usage patterns that the industry does not have the expertise in. At the same time, there is a lack of observability in these deployments which makes it difficult to debug performance issues. We will present an end to end walkthrough of how you can leverage client and server LLM observability using Open Telemetry based on the recent efforts in the Kubernetes and Open Telemetry communities to standardize these across LLM clients and model servers. We will also demonstrate how to troubleshoot a real-world performance issue in your LLM deployment and how to optimize your LLM server setup for better performance on Kubernetes. We'll show how to use Kubernetes autoscaling based on custom model server metrics and demonstrate how they offer a superior alternative to using GPU utilization metrics for such deployments.

Speakers

Liudmila Molkova

Principal Software Engineer, Microsoft

Liudmila Molkova is a Principal Software Engineer at Microsoft working on observability and Azure client libraries. She is a co-author of distributed tracing implementations across the .NET ecosystem including HTTP client instrumentation and Azure Functions. Liudmila is an active... Read More →

Ashok Chandrasekar

Senior Software Engineer, Google

Ashok Chandrasekar is a Senior Software Engineer at Google working on AI/ML experience for Google Kubernetes Engine. Previously he was a Staff Engineer at VMware where he led the cluster lifecycle management area for Tanzu Mission Control. He has 7 years of Kubernetes experience working... Read More →

optimizing llm performance pdf

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

2:30pm MST

AIStore as a Fast Tier Storage Solution: Enhancing Petascale Deep Learning Across Cloud Backends - Abhishek Gaikwad & Aaron Wilson, NVIDIA

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Grand Ballroom A

As deep learning continues to evolve, the demand for handling petascale datasets efficiently becomes paramount. Current cloud storage solutions often struggle with the speed (throughput) and cost-effectiveness required for these massive datasets, particularly due to the random access needs of machine learning workloads. This talk introduces AIStore (AIS) as a fast-tier storage solution designed to overcome these challenges by offering a fast, scalable, cost-effective tier for deep learning data. AIS features linear scalability with each added storage node - in fact, with each added drive. In this presentation, we will explore the architecture and benefits of AIStore, focusing on its linear scalability and high performance. This session will feature detailed benchmarks and use cases comparing the performance of accessing cloud datasets with and without AIStore, highlighting AIS's ability to deliver high per-GPU throughput and stable latencies.

Speakers

Aaron Wilson

Senior Software Engineer, NVIDIA

I'm a developer on the AIStore project, focused mainly on our Kubernetes deployments and Python SDK for training workflows. Most recently, I've been working on the AIS K8s Operator and Helm charts, improving AIStore deployment and lifecycle management. My past experience has been... Read More →

Abhishek Gaikwad

Software Engineer, NVIDIA

Abhishek Gaikwad is a Software Engineer at NVIDIA with a Master of Science degree in Computer Science from San Jose State University. As a key developer and maintainer of AIStore, Abhishek has played a crucial role in its design, development, and management. His contributions include... Read More →

KCNA24 AIStore.pptx pdf

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

2:30pm MST

Unifying Observability: Correlating Metrics, Traces, and Logs with Exemplars and OpenTelemetry - Kruthika Prasanna Simha & Charlie Le, Apple

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Grand Ballroom B

In modern distributed systems, observability is key to understanding application performance and behavior. While metrics, traces, and logs each provide valuable insights, their true power is realized when they are correlated. This talk will dive into the practical benefits and implementation of correlating these signals with exemplars using the OpenTelemetry SDK and Collector, and showcase the results in Grafana. Attendees will learn how to leverage OpenTelemetry to create exemplars which will allow them to navigate from either logs or metrics to their traces.

Speakers

Kruthika Prasanna Simha

Senior Software Engineer, Apple

Kruthika is a software engineer at Apple specializing in building ML enabled observability solutions. She holds a Masters in Computer Engineering and has specialized in Machine Learning. In her free time, she likes to dabble with Jupyter Notebooks for running experiments with data... Read More →

Charlie Le

Senior Software Engineer, Apple

Charlie is a software engineer at Apple, specializing in building and scaling cloud native observability solutions and infrastructure. Deeply inspired by the collaborative spirit of open source, he actively contributes to projects like Cortex and OpenTelemetry, shaping the future... Read More →

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom B

Observability

Content Experience Level Intermediate

2:30pm MST

Does My K8s Application Need CPR? Performance Evaluation of a Multi-Cluster Workload Management App - Braulio Dumba & Ezra Silvera, IBM

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | 155 B

KubeStellar (KS) is an open-source Kubernetes multi-cluster workload configuration management system that can be used to manage AI workloads in multi-cluster environments. Hence, understanding KS performance is crucial especially when managing resource intensive AI workloads. In this talk, we will present our experience in analyzing the performance metrics of KS across several dimensions of scalability (e.g., number of bindingPolicies, workload description spaces and number of managed remote clusters) and challenges that arise when conducting performance experiments in a multi-cluster environment. Our insights will demonstrate the utility of benchmarking the performance of a multi-cluster Kubernetes workload management application. Additionally, in this talk, we will demonstrate the usefulness of using several opensource tools such as clusterloader2, kube-burner & kwok to evaluate the performance of multi-cluster Kubernetes management applications.

Speakers

Ezra Silvera

Senior Technical Staff Member, IBM

Ezra Silvera is a Senior Technical Staff Member at IBM Research. His interests include distributed systems, cloud management, and cloud infrastructure. Ezra is passionate about open-source technologies and has been involved in several notable open source projects such as Docker, KubeVirt... Read More →

Braulio Dumba

Staff Research Scientist, IBM

Dr. Braulio Dumba is a Staff Research Scientist at IBM Research. In 2018, he joined IBM under the Hybrid Cloud organization. His current research is focus on edge computing and hybrid cloud computing. Dr. Dumba earned a Ph.D. in Computer Science from University of Minnesota, Twin... Read More →

KCNA24 KS Braulio Ezra pdf

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

2:30pm MST

Better Pod Availability: A Survey of the Many Ways to Manage Workload Disruptions - Zach Loafman, Google

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Grand Ballroom H

Kubernetes Pods are ephemeral, but some are more ephemeral than others. Kubernetes provides a dizzying array of options to manage and handle Pod disruption. From PodDisruptionBudgets, to "safe-to-evict" annotations, GracefulTermination timeouts and more, it can be incredibly hard to determine the optimal solution for handling Pod disruption and how to manage gracefully terminating your application. Thankfully, due to the extensible nature of Kubernetes we can build CRDs and controllers that can simplify these complex topics for end users. In this talk, we'll present an in-depth analysis of the built-in options and how they work (or don't). While this problem is not unique to game-serving, we'll deep-dive and explain how Agones (an open-source session orchestration system layered on Kubernetes) solves this problem with a simple abstraction to hide the complexity!

Speakers

Zach Loafman

Staff Software Engineer, Google

Zach leads Google’s GKE Games team. He was previously lead of the Kubernetes Control Plane team for GKE, lead of the GKE Cluster Lifecycle team, worked on Kubernetes prior to GA, and was one of the founding members of the Google Kubernetes Engine team.

KCNA24 Better Pod Availability pdf

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom H

Platform Engineering

Content Experience Level Intermediate

2:30pm MST

Tutorial: Confidential Containers 101: A Hands-on Workshop - Archana Choudhary & Suraj Deshmukh, Microsoft

Wednesday November 13, 2024 2:30pm - 4:00pm MST

Salt Palace | Level 1 | Grand Ballroom G

Here is how you can be prepared for the workshop, to follow along: https://github.com/surajssd/kubecon-na24-workshop/blob/main/preparation.md

As traditional enterprises with stringent data protection requirements become cloud-native and migrate to Kubernetes on public clouds, they are wondering: “Is my data secure on this shared hardware? Can someone with a host access snoop on my data?” And especially, with the upcoming Digital Operational Resilience Act (DORA) in Europe mandating data protection in use, it’s crucial for users to familiarize themselves with solutions like Confidential Containers (CoCo), a CNCF sandbox project. In this, first of its kind, hands-on workshop we’ll dive deep into using CoCo with k8s. We’ll explore real-world challenges, such as ensuring data confidentiality from platform owners (cloud providers), and show you how to overcome them. Through practical exercises, you’ll learn to set up CoCo and secure your containerized workloads, turning theory into practice. Attendees will discover streamlined practices, find robust protection mechanisms, and gain strategic insights into adopting CoCo.

Speakers

Suraj Deshmukh

Senior Software Engineer, Microsoft

Suraj has worked with Kubernetes since version 1.3. He organized the Kubernetes Bangalore meetup and helped bring Kubernetes to the masses. To make Kubernetes easier has worked earlier on projects like Kompose, which converted docker-compose to Kubernetes artifacts. He has spoken... Read More →

Archana Choudhary

Ms, Microsoft

A software engineer who has been exploring cloud-native technologies, particularly focusing on confidential containers over the past several months.

Kubecon NA 2024 Confidential Containers 101 pptx

Wednesday November 13, 2024 2:30pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom G

Tutorials, Security

Content Experience Level Intermediate

3:25pm MST

A Tale of 2 Drivers: GPU Configuration on the Fly Using DRA - Alay Patel & Varun Ramachandra Sekar US, Nvidia

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 2 | 255 B

NVIDIA’s GeForceNow is a cloud gaming service that allows users to stream video games from NVIDIA's servers to a wide range of devices, including PCs, Macs, Android devices, iOS devices, and smart TVs. Under the hood, it is powered by Kubernetes running Kubevirt VMs. For a seamless user experience, GeForceNow dynamically switches GPU drivers to accommodate either passing through an entire GPU or slicing it into multiple virtual GPUs, all while keeping utilization close to 100% across the datacenter. This poses significant challenges when using the traditional device plugin API provided by Kubernetes. In this talk, we explore GeForce Now’s journey to transition away from the traditional device plugin API in favor of Dynamic Resource Allocation (DRA). We'll share valuable insights for anyone looking to perform a similar migration of their own. Join us to learn about the challenges, solutions, and best practices to help optimize your GPU-accelerated workloads in the cloud.

Speakers

Alay Patel

Senior Software Engineer, Nvidia

Alay is a Senior Software Engineer at Nvidia where he works on scaling up the workloads running on GPU Compute infrastructure. He is passionate about open source with a focus on Kubernetes and platform engineering.

Varun Ramachandra Sekar US

Senior Software Engineer, Nvidia

Developer by day, Dog whisperer by night.

KCNA24 A Tale of 2 Drivers GPU Configuration on the Fly Using DRA (2) pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 B

AI + ML

Content Experience Level Intermediate

3:25pm MST

Using OpenTelemetry for Deep Observability Within Messaging Queues - Shivanshu Raj Shrivastava & Ekansh Gupta, SigNoz

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | Grand Ballroom B

The recent changes in OpenTelemetry have made new semantic conventions and changes in agents to better monitor messaging queues such as Kafka, RabbitMQ, and Amazon SQS, etc. In this session, we'll discuss how those semantic conventions are standardizing the telemetry collected from producers, consumers, and the messaging queues, and how in-depth observability can be achieved by correlating producer-to-consumer spans with the metrics collected from Kafka. Additionally, We will demonstrate how the Kafka Java client side instrumentation enabled and JMX metrics collected from Kafka how OpenTelemetry instrumentation can help for metrics to trace and trace to metrics correlation and spot reasons for anomalies like increased consumer lag, partition failures, time taken by messaging queues. This will also help in giving the corresponding traces in time that can help end users to better delve into their infrastructures and optimize their asynchronous applications.

Speakers

Ekansh Gupta

SDE, SigNoz

Ekansh is a Software Development Engineer with SigNoz, with active involvement in various open-source and cloud native communities for upwards two years now. He was previously an SDE Intern at SteamLabs. He is also a speaker for a couple of talks at PyCon, KubeCon and MozFests. Ekansh... Read More →

Shivanshu Raj Shrivastava

Founding Engineer, SigNoz

Shivanshu is a Founding Engineer at SigNoz, working on building an OTeL native observability product. He has a keen interest in deep tech and OSS. He is a CNCF ambassador and a member of CNCF projects like OTeL, k8s, and Istio. He has got the opportunity to mentor contributors in... Read More →

Using OTel for Deep observability within messaging queue pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom B

Observability

Content Experience Level Intermediate

3:25pm MST

Setting New Standards for Reliability in Cloud Native Multi-Region Applications - Trey Caliva, Global Payments

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | 155 B

As a multinational FinTech provider, processing over 32 billion card transactions for 816 million accounts, Global Payments requires globally available architectures with quick disaster recovery while maintaining subsecond latencies. In addition, these workloads require strict adherence to compliance standards. This session will explore the high-level architectural decisions implemented in a cloud-native redesign and cloud migration of a mission critical legacy .NET application. Key cloud native tools leveraged include Kubernetes on GCP, and the use of CockroachDB as a cloud native database solution. We will explore how leveraging these cloud native technologies achieved extreme fault tolerance in a multi-region deployment, setting new standards for performance and reliability.

Speakers

Jim Hatcher

Solutions Engineer, Cockroach Labs

Trey Caliva

Principal Cloud Architect, Global Payments

Trey Caliva is an Architect and engineer with 10+ years of hands-on experience planning, developing, managing, and securing deployments in Google Cloud and AWS. He is currently Principal Cloud Architect at Global Payments, a Fortune 500 company and a member of the S&P 500 focused... Read More →

KCNA24 GP Setting New Standards for Reliability in Cloud Native Applications pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

3:25pm MST

Scale Job Triggering with a Distributed Scheduler - Cassie Coyle & Artur Souza, Diagrid

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 2 | 250 AD

Imagine scheduling thousands or millions of jobs that are persisted and triggered timely and resilient to downtime. Some jobs might be triggered every second while others need to reliably be triggered on the first day of the month. Achieving high throughput and reliability is critical for the performance and operational efficiency of modern distributed systems. How can traditional cron job scheduling be extended? How can distributed systems handle job scheduling with minimal downtime? What challenges arise when scaling job scheduling to thousands or millions of jobs? In this session, Artur and Cassie will delve into the design of Dapr’s distributed Scheduler and how users can start using it today. You will gain a comprehensive understanding of how Dapr’s Scheduler unblocks scalability of actors and workflows while also enabling new capabilities, like delayed pubsub and schedule job API.

Speakers

Artur Souza

Head of Engineering, Diagrid

I am a maintainer of Dapr since 2019, helped the project reach the 1.0 stable version and keeping frequent releases since then. Currently Head of Engineering at Diagrid, leading the engineering teams building Conductor and the next generation of managed cloud native APIs via Dapr... Read More →

Cassie Coyle

Software Engineer, Diagrid

Cassie, a devoted software engineer at Diagrid actively contributes to Dapr, focusing on Go backend development to simplify the creation of resilient, event-driven, and microservices-based apps. She is a member of the Dapr Day and AppDeveloperCon 2024 program committees. Her work... Read More →

ScaleJobTriggeringWithADistributedScheduler pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

3:25pm MST

CEL-Ebrating Simplicity: Mastering Kubernetes Policy Enforcement - Kevin Conner, Getup Cloud & Anish Ramasekar, Microsoft

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | 151 G

As Kubernetes deployments grow increasingly complex, robust policy enforcement is crucial. The Common Expression Language (CEL) provides a powerful solution, enabling the creation of sophisticated, human-readable expressions for Kubernetes policies. This session explores CEL's integration with Kubernetes, simplifying policy definition and enforcement. Key takeaways: - Fundamentals of CEL and its Kubernetes integration. - Practical use cases for CEL in admission control, resource management, and security. - Enhancing policy expressiveness and flexibility with CEL. - Introduction to CEL Playground for testing and validating CEL expressions. Through live demos, learn to leverage CEL and CEL Playground for streamlined policy management in Kubernetes. Ideal for administrators, developers, and DevOps professionals, this session equips you to enhance your Kubernetes policies using CEL. Join us to discover how CEL and CEL Playground can transform your Kubernetes policy management.

Speakers

Anish Ramasekar

Principal Software Engineer, Microsoft

Anish Ramasekar is a software engineer at Microsoft. He is on the Azure Container Upstream team building features for Kubernetes upstream and various CNCF projects that are part of the Azure Kubernetes Service. Anish is a maintainer of the Secrets Store CSI Driver project.

Kevin Conner

Chief Engineer, Getup Cloud

Kevin Conner is the Chief Engineer at GetUp Cloud, a startup focused on Kubernetes and DevSecOps. He has worked at startups like Integrated Micro Products, Arjuna Technologies, JBoss, and Aviatrix, as well as Sun Microsystems and Red Hat where he led teams for Cloud Enablement, Service... Read More →

CEL Ebrating Simplicity Mastering Kubernetes Policy Enforcement pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

4:30pm MST

Making Kubernetes Simpler for Accelerated Workloads - Susan Wu, Google; Lucy Sweet, Uber; Mitch McKenzie, Weave; Aditya Shanker, Crusoe; Rebecca Weekly, Geico

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 2 | 255 B

Kubernetes and the open-source ecosystem for AI frameworks have been great for LLM innovation, empowering developers to build applications that use natural language as the interface to data. Yet, many developers and cluster operators struggle to put these frameworks into production use. In this session, hear from several platform engineers responsible for designing core infrastructure supporting accelerated workloads, services, large language model training and inference pipelines. You can expect to come away with guidance, hear of pitfalls to watch out for and learn how they successfully abstracted the infrastructure complexity to improve their research users' experience and velocity. Panelists include: Lucy Sweet, Senior Software Engineer (Infrastructure), Uber, Mitch McKenzie, Site Reliability Engineer - Machine Learning Operations, Weave, Susan Wu, Outbound Product Manager, Google

Speakers

Rebecca Weekly

Geico

Susan Wu

Outbound Product Manager, Google

Susan is an Outbound Product Manager for Google Cloud, focusing on GKE Networking and Network Security. She previously led product and technical marketing roles at VMware, Sun/Oracle, Canonical, Docker, Citrix and Midokura (part of Sony Group). She is a frequent speaker at conferences... Read More →

Lucy Sweet

Senior Software Engineer at Uber, Uber

Lucy is a Senior Software Engineer at Uber Denmark who works on software infrastructure

Mitch McKenzie

Weave

Aditya Shanker

Senior PM, Crusoe

Aditya is a Senior PM at Crusoe Cloud, responsible for building out managed orchestration and platform services

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 B

AI + ML

Content Experience Level Intermediate

4:30pm MST

Platform Performance Optimization for AI - a Resource Management Perspective - Antti Kervinen, Intel & Dixita Narang, Google

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 2 | 255 E

How much node resource management can affect AI workload performance? What options are there? What is the trade-off between total throughput and low latencies? In this talk we take a systematic approach to Platform Performance Optimization. We walk through the whole path from goal setting, gathering data, analysis, visualizations and conclusions. At each stop along the path we share our practical experiences in a case of LLM inference optimization. You will find many considerations, findings and practical tricks to take away. For instance, how to instrument PyTorch without touching the source or a container image, how to enable changing what we are measuring without new expensive benchmark reruns, and how much more we can learn from visualizations compared to numeric averages and percentiles. Finally we share real results from our case: how resource management increased total token throughput per worker node by more than 3.5x from the baseline.

Speakers

Antti Kervinen

Cloud Orchestration Software Engineer, Intel

Antti Kervinen is a Cloud Orchestration Software Engineer working at Intel, whose interest in Linux and distributed systems has led him from academic research of concurrency to the world of Kubernetes. When unplugged, Antti spends his time outdoors discovering wonders of nature.

Dixita Narang

Software Engineer, Google

Dixita Narang is a Software Engineer at Google on the Kubernetes Node team. With a primary focus on resource management within Kubernetes, Dixita is deeply involved in the development and advancement of the Memory QoS feature, which is currently in the alpha stage. She is a new contributor... Read More →

kubecon 2024na platform performance pdf

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

4:30pm MST

From Observability to Performance - Nadia Pinaeva, Red Hat & Antonio Ojea, Google

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 1 | 155 E

No matter how fast the Services on your Kubernetes cluster are, users would love them to be faster. But how do you get from a huge pile of metrics across a distributed system to real user experience improvements? There is a way, and with the right tools and the right approach, you can better understand and evaluate Service performance. In this talk, you'll learn how to identify the performance parameters that directly translate to user experience. We will explore how to collect performance metrics from running Kubernetes clusters without disrupting normal operations using tools like Prometheus, Grafana, kube-burner, and custom instrumentation. We will discuss how to translate the collected metrics and analysis into concrete actions and how to identify bottlenecks and implement optimizations to enhance Service performance. This talk is ideal for k8s networking developers, administrators, SREs, DevOps engineers, and anyone responsible for managing or optimizing Kubernetes networking.

Speakers

Antonio Ojea

Software Engineer, Google

Antonio Ojea is a Software Engineer at Google, where he works on Kubernetes. He is one of the top contributors of the Kubernetes project, with a stronger presence on the areas of networking and reliability. He has a vast experience in Open Source, networking and distributed systems... Read More →

Nadia Pinaeva

Senior Software Engineer, Red Hat

Nadia Pinaeva is a Senior Software Engineer at Red Hat working on Openshift Networking. She collaborates with the SIG-network-policy to improve network security for Kubernetes clusters, and works on ovn-kubernetes network plugin.

From Observability to Performance pdf

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

4:30pm MST

Experience in Designing & Implementing a Cloud Native Framework for Farm Data Analytics - Braulio Dumba, IBM & Gloire Rubambiza, Cornell University

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 2 | 254 B

This work is based on 17 months experience managing a digital agriculture platform that has aggregated and processed tens of gigabytes of data on 1500 cows on a commercial dairy farm. Significant challenges surfaced tied to multi-cluster management, fault-tolerance, and privacy as the number of applications and farm management models grew. To bridge this gap, we designed and implemented a cloud native networked system for multi-cluster configuration and management of farm data analytics that leverages KubeStellar and Software-Defined Farm paradigm. Our experience from designing, implementing and deploying this framework showcase how Kubernetes can enable farmers and agribusinesses to leverage the power of containerization and cloud-native computing to optimize workflows and streamline agricultural operations. This work presents progress towards cloud-native, scalable, and fault-tolerant data analytics in digital farming with potential environmental, financial, and societal impacts.

Speakers

Braulio Dumba

Staff Research Scientist, IBM

Gloire Rubambiza

Ph.D. Candidate, Cornell University

Dr. Gloire Rubambiza is a postdoctoral associate in CS at Cornell University, where he conducts research in hybrid cloud computing for digital agriculture with an emphasis on societal impact. At Cornell, he was a University Fellow, a fellow of NSF National Research Traineeship in... Read More →

KCNA24 KS SDF Braulio Gloire 11102024 pdf

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 254 B

Emerging + Advanced

Content Experience Level Intermediate

4:30pm MST

Perform Laser Focused Deployments by Deciding in Advance the Blast Radius - Kostis Kapelonis, Octopus deploy

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 2 | 250 AD

Progressive Delivery is an advanced deployment method that allows for zero-downtime application releases. Argo Rollouts is a Kubernetes controller that allows you to adopt progressive delivery in the form of blue/green and canary deployments. We see a lot of teams that choose an arbitrary number of clients that access the new version of a canary. Yes, it is very easy to send only 10% of the traffic to the new version of a Kubernetes deployment. But sometimes you want to choose WHICH 10% sees the new traffic. In this talk we will see several approaches on pinning down specific clients to the old or new version and advanced scenarios for sending canary traffic only to a specific subset of users such as internal employees or customers who have expressed their interest on seeing brand new releases as soon as possible.

Speakers

Kostis Kapelonis

Developer Advocate, Codefresh by Octopus Deploy

Kostis is a software engineer/technical-writer dual class character. He lives and breathes automation, good testing practices and stress-free deployments with GitOps.

Laser focused deployments pdf

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

4:30pm MST

Expanding the Capabilities of Kubernetes Access Control - Jimmy Zelinskie, authzed & Lucas Käldström, Upbound

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 1 | 151 G

Kubernetes RBAC is an effective way of managing ACLs in one cluster. However, there are many other effective paradigms out there, such as Attribute- & Relation-based Access Control. In this talk, we’ll demystify how these differ, and when to use respective paradigms, giving context and guidance. We’ll highlight how Kubernetes access control has recently evolved towards supporting lots of different use-cases. We take this opportunity to cover multiple perspectives: security within a single cluster (zooming in) and security within real-life production environments with external services and multiple clusters (zooming out). As containers became ubiquitous first with excellent tools like Docker, we believe the same can and will happen for access control, yielding uniform, interoperable and understandable authorization. Finally, we'll propose future work that could be done to supercharge Kubernetes and ensure it keeps up with the ever increasing security requirements in our industry.

Speakers

Lucas Käldström

Senior Software Engineer, Upbound

Lucas is a Kubernetes and cloud native expert who has been serving the CNCF community in lead positions for 6 years. He’s awarded Top CNCF Ambassador 2017 with Sarah Novotny. Lucas was a co-lead for SIG Cluster Lifecycle, co-created kubeadm, Weave Ignite, and ported Kubernetes to... Read More →

Jimmy Zelinskie

Co-founder, authzed

Jimmy Zelinskie is a software engineer and product leader with a goal of democratizing software via open source development. He's currently CPO of authzed where he's focused on bringing hyperscaler best-practices in authorization to the industry at large. At CoreOS, he helped pioneer... Read More →

Expanding the Capabilities of Kubernetes Access Control pdf

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

4:30pm MST

Tutorial: Get the Most Out of Your GPUs on Kubernetes with the GPU Operator - Eduardo Arango Gutierrez, Tariq Ibrahim, Amanda Moran & Christopher Desiniotis, NVIDIA; David Porter, Google

Wednesday November 13, 2024 4:30pm - 6:00pm MST

Salt Palace | Level 1 | Grand Ballroom G

NVIDIA’s GPU operator has become the de-facto standard for managing GPUs in Kubernetes at scale. This tutorial provides in-depth, hands-on training on the various GPU sharing techniques that are possible with the GPU operator. Participants will learn to deploy jobs utilizing these sharing techniques, as well as get hands-on experience on the installation and configuration of the NVIDIA GPU Operator itself. This includes an in-depth exploration of its two primary CRDs: ClusterPolicy and NVIDIADriver. These CRDs are essential for configuring GPU-accelerated nodes, enabling GPU sharing mechanisms, and performing GPU driver upgrades. The session will culminate with practical use cases, such as training an AI/ML model and giving participants firsthand experience in managing a GPU-accelerated Kubernetes cluster.

Speakers

Christopher Desiniotis

Senior Systems Software Engineer, NVIDIA

Christopher Desiniotis is a Senior Systems Software Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator, a widely used tool for managing GPUs in Kubernetes, and is focused on increasing... Read More →

David Porter

Staff Software Engineer Google, Google

David Porter is a Staff Software Engineer at Google on the Kubernetes node team. David’s focus is on the kubelet node agent and the resource management area. He is primary maintainer of cAdvisor, a resource monitoring library widely used in kubernetes, reviewer of a SIG Node, and... Read More →

Eduardo Arango Gutierez DE

Senior systems software engineer, NVIDIA

Eduardo is a Senior Systems Software Engineer at NVIDIA, working on the Cloud Native Technologies team. Eduardo has focused on enabling users to build and deploy containers on distributed environments.

Tariq Ibrahim

Senior Software Engineer, NVIDIA

Tariq Ibrahim is a Senior Cloud Platform Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator. He has also contributed to several cloud native OSS projects like kube-state-metrics, Istio... Read More →

Amanda Moran

https://www.nvidia.com/en-us/, NVIDIA

Amanda has been working in technology since graduating from SCU in 2012 with a Master’s in Science in CS. Prior to this she had graduated with an BS in Biology from UW. Amanda has worked the last 12 years as a Software Engineer, a Solutions Architect, and an Engineering Manager... Read More →

Wednesday November 13, 2024 4:30pm - 6:00pm MST
Salt Palace | Level 1 | Grand Ballroom G

Tutorials, AI + ML

Content Experience Level Intermediate

5:25pm MST

Building Resilience for Large-Scale AI Training: GPU Management, Failure Detection, and Beyond - Ganeshkumar Ashokavardhanan, Microsoft & Ace Eldeib, Cohere

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | 155 E

As AI training scales to thousands of GPUs across hundreds of machines, hardware failure becomes an expensive risk. From GPU faults to network performance degradation, undetected problems can sabotage training jobs, inflating costs, and slowing development. This talk dives into failure and orchestration challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on experience from cloud providers and training large language models, we will share best practices for efficient identification, remediation, and prevention of GPU failures.

Speakers

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft

Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →

Ace Eldeib

Staff Software Engineer, Cohere

Ace is a Staff Software Engineer at Cohere working on training and serving infrastructure for large language models. Prior to that, he worked on Azure Kubernetes service and ran self-managed Kubernetes for other Azure services.

Upload Kubecon Resilience for AI Training pdf

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 155 E

AI + ML

Content Experience Level Intermediate

5:25pm MST

Production AI at Scale: Cloudera’s Journey in Building a Robust Inference Platform - Zoram Thanga & Peter Ableda, Cloudera

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 2 | 255 E

In this session, we talk about Cloudera AI Inference Service, a secure, large scale platform for generative AI and predictive inference workloads, built using state of the art Kubernetes, CNCF and Apache open source projects. We take the audience through our journey in building this platform and share the experiences we gained along the way. The platform is built using openness, security, scalability, performance and standards compliance as guiding principles. We demonstrate that it is possible to be open and secure at the same time, and that organizations can incorporate production grade AI inferencing into their Big Data environments. This session will cover the architecture of the platform, and explain how we handle performance, scaling, authentication, fine grained authorization and audit logging, all of which are critical considerations for production inferencing.

Speakers

Peter Ableda

Director, Product Management, Cloudera

Peter Ableda is the Director of Product Management for Cloudera’s AI product suite, bringing over a decade of experience in data management and advanced analytics. Holding a Master of Science degree in Computer Science from the Budapest University of Technology, Peter has dedicated... Read More →

Zoram Thanga

Principal Engineer, Cloudera

Zoram is a Principal Engineer, Enterprise AI Platform in Cloudera. He has been working in the software industry for over 23 years, and has been involved in building clustering software, containers, file systems, analytical query engines, and ML/AI platforms. He is a committer in the... Read More →

production ai at scale.pptx pdf

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

5:25pm MST

Creating Paved Paths for Platform Engineers - Ritesh Patel, Nirmata; Abby Bangser, Syntasso; Viktor Farcic, Upbound; Nicholas Morey, Akuity; Praseeda Sathaye, Amazon

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | Grand Ballroom H

The platform engineering team's role has evolved into a pivotal one as the custodian of the internal developer platform. However, these teams often find themselves in a quagmire of identifying the right components to include in their platforms, particularly in the ever-expanding CNCF landscape. This panel session discusses these challenges by exploring the concept of 'Paved Paths' as a strategic approach to guide platform teams in their journey of building an internal developer platform (IDP). 'Paved Paths' offers a solution by providing platform engineering teams with proven reference architectures (e.g. CNOE and the BACK Stack). This approach prevents them from starting from scratch and getting lost in the vast CNCF landscape. By offering proven and opinionated reference architectures, platform teams can focus on enhancing developer experiences and optimizing higher-level workflows rather than grappling with the complexities of identifying foundational components for their IDP.

Speakers

Viktor Farcic

Developer Advocate, Upbound

Viktor Farcic is a lead rapscallion at Upbound, a member of the CNCF Ambassadors, Google Developer Experts, CDF Ambassadors, and GitHub Stars groups, and a published author. He is a host of the YouTube channel DevOps Toolkit and a co-host of DevOps Paradox.

Ritesh Patel

Co-Founder & VP Product, Nirmata

Ritesh Patel is Co-founder and leads Products at Nirmata, the creators of Kyverno. At Nirmata, he is responsible for commercial products for security and operations (SecOps) automation powered by policy as code. He also leads key technology partnerships. Ritesh has 20+ years of experience... Read More →

Praseeda Sathaye

Principal Specialist Solution Architect, Amazon (AWS)

Praseeda Sathaye is a Principal Specialist SA for App Modernization and Containers at Amazon Web Services based in Bay Area California. She has been focused on helping customers speed their cloud-native adoption journey by modernizing their platform infrastructure, internal architecture... Read More →

Nicholas Morey

Senior Developer Advocate, Akuity

Nicholas Morey is a Platform Engineer with a passion for DevOps practices. He is on the team at Akuity as a Developer Advocate, working with the community on anything Argo and Kargo-related. He is an experienced Argo CD operator and a Certified Kubernetes Administrator.

Abby Bangser

Principal Engineer, Syntasso

Abby is a Principal Engineer at Syntasso delivering Kratix, an open-source cloud-native framework for building internal platforms on Kubernetes. Her keen interest in supporting internal development comes from over a decade of experience in consulting and product delivery roles across... Read More →

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | Grand Ballroom H

Platform Engineering

Content Experience Level Intermediate

5:25pm MST

Taming Your Application’s Environments - Marcos Lilljedahl, Dagger & Mauricio "Salaboy" Salatino, Diagrid

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 2 | 250 AD

How coupled are your applications code and pipelines to its target cloud or on-prem environment? Kubernetes helps us to abstract how we run our workloads. However, there are other aspects, like infrastructure dependencies, service configuration, build process, deployment descriptors, etc., which need to be considered to make an application portable across multiple environments. Focusing on these aspects make a big difference when migrating apps to reduce costs, meeting compliance requirements or leveraging a specific tech only available somewhere else. Join us to cover three techniques you can implement to level up your SDLC: - Modularizing and enhancing our delivery pipelines to simplify complex environments (Crossplane and Dagger) - Building consistent experiences around well-known interfaces (CloudEvents, Dapr, and OpenFeature) to minimize runtime drift. - Design with separation of concerns to enable fast feedback loops between development and operation teams (Argo CD, Knative)

Speakers

Marcos Lilljedahl

Software Engineer, Dagger

Dad, Docker Captain, OSS lover, helmsman and wine drinker. Father of a joyful kid and wannabe surfer. I like listening to jazz music and tinker with some fun projects when possible. Avid open source contributor.

Mauricio Salatino

OSS Software Engineer, Diagrid

Mauricio works as an Open Source Software Engineer at @Diagrid, contributing to and driving initiatives for the Dapr OSS project. Mauricio also serves as a Steering Committee member for the Knative Project and Co-Leading the Knative Functions initiative. He published a book titled... Read More →

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

5:25pm MST

From Observability to Enforcement: Lessons Learned Implementing eBPF Runtime Security - Anna Kapuścińska & Kornilios Kourtis, Isovalent

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | 151 G

eBPF is getting widely adopted in cloud native runtime security tools like Falco, KubeArmor, and Tetragon. Using eBPF we can collect relevant security events right in the kernel and pass them to Security Engineers for retroactive attack detection and response. Having reliable and complete visibility is great, but wouldn't it be even better to proactively prevent attacks in progress? This talk covers the Tetragon team’s experience moving from security observability to enforcement and lessons learned along the way: from defining security models to hardening interactions between the local kernel and distributed Kubernetes systems. It will deep dive into how eBPF-based enforcement works, why it differs from observability, and the challenges of implementing it. The audience will walk away understanding the inner workings and common pitfalls of eBPF-based runtime security.

Speakers

Kornilios Kourtis

Software Engineer, Isovalent at Cisco

I am a software engineer at Isovalent, working on cloud-native networking, security, and observability using eBPF. Before that, I worked in industrial (IBM) and academic research (ETH Zurich, NTU Athens) in systems, including operating systems, storage and network stacks, and high-performance... Read More →

Anna Kapuscinska

Software Engineer, Isovalent at Cisco

Anna is a software engineer at Isovalent, focusing on eBPF-based observability and security. Her previous roles span the industry: she wore both developer and SRE hats, and worked in AdTech, FinTech, public healthcare, end-user SaaS company and a hosting provider. On good weather... Read More →

Kubecon 2024 NA From Observability to Enforcement Lessons Learned Implementing eBPF Runtime Security pdf

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS01): Climatik: Cloud Native Sustainable LLM via Power Capping - Chen Wang, IBM & Vincent Hou, Bloomberg L.P.

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

As GenAI workloads grow, the need for advanced accelerators with higher power consumption is surging. NVIDIA GPU peak power has risen from 300W for V100 to 1000W for B100. However, current power infrastructure and cooling systems are not designed to handle rapid power increases, leading to challenges like limited accelerator deployment in some regions or overheating risks that could cause fire hazards. We propose Climatik, a dynamic power capping system that enables data center and cluster admins and developers to set power caps dynamically at the cluster, service namespace, and rack levels. Climatik leverages Kepler for observability and offers APIs for integration with Kubernetes control knobs, including autoscalers, schedulers, and queuing systems, to ensure power caps are maintained across all levels. We will demo how to use Climatik to configure power capping for a large language model (LLM) inference service on KServe and show how power capping influences KEDA on autoscaling.

Speakers

Chen Wang

Senior Research Scientist, IBM

Chen Wang is a Staff Research Scientist at the IBM T.J. Watson Research Center. Her interests lie in Kubernetes, Container Cloud Resource Management, Cloud Native AI systems, and applying AI in Cloud system management. She is an open-source advocate, a Kubernetes contributor, and... Read More →

Vincent Hou

Senior Software Engineer, Bloomberg L.P.

Vincent Hou is a Chinese software engineer, who used to study in Belgium and is currently working in US. He has been an active open source contributor, since 2010. He used to be an active contributor to Cinder project, OpenStack block storage service, and a core committer of OpenWhisk... Read More →

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, AI + ML

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS02): 0.0.0.0 Day: Exploiting Localhost APIs from the Browser - Mic McCully, Oligo

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

Browser-based attacks are not new in the malicious landscape of attack patterns. Browsers remain a popular infiltration method for attackers. While seemingly local, services running on localhost are accessible to the browser using a flaw we found, exposing the ports on the localhost network interface, and leaving the floodgates ajar to remote network attacks. In this live demo and attack simulation we’ll unveil a zero-day vulnerability (still under responsible disclosure) in Chrome and other browsers, and how we use the 0-day to attack developers behind firewalls. We will demonstrate remote code execution on a wildly popular open-source platform serving millions in the data engineering ecosystem, that seems to run on localhost. In our talk, we will present novel attack techniques, targeting developers and employees within an organization, that are behind firewalls. This will be a first-ever deep dive into this newly discovered zero-day vulnerability.

Speakers

Mic McCully

Field CTO, Oligo Security

Poster slide Kubecon (Mic) pdf

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Security

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS03): Unleashing the Power of Init and Sidecar Containers in Kubernetes - Carlos Sanchez & Natalia Angulo, Adobe

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

This session dives deep into the power of init and sidecar containers, the issues they solve and why they are very useful when managing Kubernetes workloads. We will explore real-world use cases that show how these tools can: * Simplify complex deployments: Break down intricate deployments into manageable steps. * Enhance security: Isolate security critical tasks within your pods and ongoing security measures. * Facilitate rapid and isolated changes: when everyone is interested in updating the same service, separation of concerns is critical for rapid development. * Boost application functionality: Utilize sidecar containers to inject essential functionalities like logging, monitoring, and networking capabilities without modifying your main application code. Our goal is to share our experience and challenges managing thousands of environments in Kubernetes, how we manage init and sidecar containers and what problems they solve for us.

Speakers

Natalia Angulo

Software Developer Engineer, Adobe

Natalia Angulo is a Software Development Engineer at Adobe Experience Manager, contributing to Site Reliability tasks and the development of new features inside AEM, and specially helping with their infrastructure management. She is passionate about maths, coding puzzles and teaching... Read More →

Carlos Sanchez

Principal Scientist, Adobe

Carlos Sanchez is a Principal Scientist at Adobe Experience Manager, specializing in software automation, from build tools to Continuous Delivery and Progressive Delivery. Involved in Open Source for over 20 years, he is the author of the Jenkins Kubernetes plugin and a member of... Read More →

Poster Unleashing the Power of Init and Sidecar Containers KubeCon NA 2024 pdf

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Operations + Performance

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS04): Optimizing Pod Affinity in Kubernetes: A Mathematical Approach to Workload Placement - Jack Xue, Microsoft

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

A standout feature of Kubernetes is its sophisticated mechanism for pulling container images from repositories, aligning containers with the appropriate pods, and strategically deploying pods to nodes that meet their resource requirements—such as CPU, GPU, RAM, network, and storage. This process adheres to the defined affinity and anti-affinity specifications between pods and nodes. Despite these capabilities, the challenge of optimally arranging a multitude of workloads, each comprising several pods within a cluster, remains an ongoing endeavor. In our research, we illustrate that a set of YAML files, which detail a workload deployment request, can be systematically transformed into a Binary Integer Linear Programming (BILP) model. Depending on the specific optimization goals, the objective functions of the model can be tailored accordingly. With the imposition of broad conditions, it is feasible to derive an optimal solution that adheres to polynomial time complexity constraints.

Speakers

Jack Xue

Principal Cloud Solution Architect, Microsoft

PhD & MBA. Principal Cloud Solution Architect, Microsoft

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Platform Engineering

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS06): What's Happening with SPIFFE and WIMSE? - Daniel Feldman, Qusaic

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

This session will be a very brief overview of what's going on with the SPIFFE and WIMSE identity standards projects. SPIFFE is a CNCF effort to standardize workload identity implementations. That is, a SPIFFE implementation can grant services unique identities and credentials. WIMSE is an IETF effort to build on the SPIFFE foundation. In particular, it adds a new, unique token format that allows securely recording multi-hop identity information. Implementors will be able to use this token format to build complete, end-to-end, cryptographically auditable identity records.

Speakers

Daniel Feldman

Founder, Qusaic

Daniel Feldman has worked with many companies, large and small, to deploy SPIFFE and SPIRE zero-trust identity.

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Security

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS07): Unleashing the Power of Prediction to Proactively Scale Control Plane Components - Anubhav Aeron & Ryan Tay, Intuit

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

At Intuit, our control plane components such as IstioD are responsible for hundreds of applications per cluster. It is responsible for configuring data plane, as well as injecting the istio-proxy container. With an increase in application traffic, there is an increase in application pods, which results in the control plane to scale up. For critical control planes such as IstioD, it is wise to scale proactively, rather than as a reaction to increase in load. With traditional approaches, like tuning HPA thresholds, to scale in advance, we might pre scale even when not required due to outliers, which could be wasteful. At Intuit a novel deep learning forecasting model called N-HiTS was employed to solve this issue. This session will discuss and demo how we train N-HiTS, our most important model features, and how we deploy our service on a per-cluster basis to provide contextualized predictions for cost effective and on time auto-scaling.

Speakers

Anubhav Aeron

Senior Staff SE, Coupang

Anubhav is a seasoned software engineer in the field of Cloud Native Technologies, and has been doing Kubernetes and Service Mesh since 2016. He developed Redis Cluster as a Service, and a Templating Engine while working at Yahoo! He is the lead maintainer of Admiral, which is an... Read More →

Ryan Tay

Software Engineer, Intuit Inc.

As a software engineer on the Service Mesh team at Intuit, Ryan works to support Intuit's extensive Istio deployment through contributions to projects like Admiral. He has previously worked to reduce costs of cloud development environments for the Intuit API Gateway team. His main... Read More →

Kubecon IstioD Predictor pdf

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Platform Engineering

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS09): Kubernetes as a Geographically Distributed System - Chris Friesen, Wind River Systems

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

Kubernetes was designed to be the best container orchestration platform on top of a cloud infrastructure in one data center. What do you do when you want to take your deployment and grow it in various geographical locations, but sill keep it as part of one system? You will have to face with complexity and figure out infrastructure management on a massive scale, and neither of these is easy to tackle. However, you don't have to go back to the drawing board, because the platform that delivers on requirements and expectations, already exists and it is called StarlingX. The StarlingX project is a fully integrated, open source cloud platform that is running in production at large telecom operators, who rely on its distributed cloud architecture along with next-level container orchestration support, which is provided by Kubernetes. This talk will introduce the StarlingX platform, share highlights from its latest release and show how it takes Kubernetes to the next level!

Speakers

Chris Friesen

Member of Technical Staff, Wind River Systems

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Operations + Performance

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS10): Accepting Mortality: Strategies for Ultra-Long Running Stateful Workloads in K8s - Sebastian Beyvers & Maria Hansen, Giessen University

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

"Pods are mortal" is a well-known quote in the official Kubernetes documentation. For ultra-long running stateful workloads that take months to complete, this mortality comes with its own challenges. How do you react to hardware failures? What resource quotas are appropriate? What if the workload has no built-in persistence and does all its work in memory? For such workloads, failures can be fatal, potentially wiping out months of work. This session will show that despite all the obstacles, Kubernetes can still be a reasonable choice for running stateful workloads that take months to complete. Using real-world examples based on production workflows, we will show how we design, configure, run, and operate such workloads using K8s and Argo workflows. We will also show how intelligent checkpointing using CRIU can help us deal with failures and enables us to avoid some problems even before they occur.

Speakers

Sebastian Beyvers

Distributed Systems Researcher, Giessen University

Sebastian Beyvers is a distributed systems researcher in bioinformatics and a cloud-native Rust developer at Giessen University. Sebastian's current work focuses on cloud-native data storage and processing solutions that try to harmonize existing national and international data ecosystems... Read More →

Maria Hansen

Research Associate, Giessen University

Maria Hansen is a research assistant in the field of (bio)informatics at Justus Liebig University Giessen. She is currently working on a cloud-native data orchestration system that aims to unite existing national and international data ecosystems.

PS10 Accepting Mortality pdf

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Emerging + Advanced

Content Experience Level Intermediate

11:00am MST

Harnessing the Power of Envoy Proxy for Building an LLM Gateway - Idit Levine, Solo.io

Thursday November 14, 2024 11:00am - 11:35am MST

Salt Palace | Level 1 | 155 E

As the demand for LLMs continues to soar, the need for secure, cost-conscious, and content-aware control over its usage is paramount. In this talk, we explore why Envoy Proxy is the optimal choice for building an LLM gateway, leveraging its unique architecture and capabilities. Unlike traditional proxies (e.g. NGINX), which rely on scripting languages for customization, Envoy Proxy stands out due to its extensibility features: filter architecture, callout architecture (ext-proc, ext-auth), and ability to dynamically load libraries. Combined with its high-performant, async core ( C++), Envoy can run as an ingress, egress and mesh gateway. We'll look at using Envoy proxy for LLM credential management, prompt guarding/decorting, analyzing content safety, usage controls, context-aware failover, and observability. Ideal for developers, architects, and tech enthusiasts looking to solve challenges around LLM usage and picking the right technologies for their platform infrastructure.

Speakers

Idit Levine

Founder & CEO, Solo.io

Idit Levine is the founder and CEO of Solo.io, a company that creates open-source tools to assist enterprises in adopting and extending innovative cloud-native technologies while modernizing their existing IT investments. Solo.io is a top contributor to CNCF projects such as Envoy... Read More →

Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

11:00am MST

Cooperative Scheduling for Stateful Systems - Michael Youssef & Zhantong Shang, LinkedIn

Thursday November 14, 2024 11:00am - 11:35am MST

Salt Palace | Level 1 | Grand Ballroom A

At LinkedIn, we develop many stateful systems and run them on tens of thousands of machines in our datacenters. As we move LinkedIn’s infrastructure to Kubernetes, we quickly realized that StatefulSet was not going to be enough to support running critical stateful systems and satisfy the safety and durability goals of the teams developing stateful systems. We've built first-class support for running stateful workloads on bare metal where the stateful systems can coordinate with Kubernetes to stay available and ensure durability. With our design, we support planned/unplanned maintenance, swapping out hardware, and allow stateful systems to customize their rollout policies natively on Kubernetes. This talk covers: - Our LiStatefulSet API. - How we allow apps to customize safety checks and deployment policies via an ApplicationClusterManager, our pluggable policy engine. - The ApplicationClusterManager protocol that allows coordination of the lifecycle of workloads with Kubernetes.

Speakers

Zhantong Shang

Sr. Software Engineer, LinkedIn

Michael Youssef

Staff Software Engineer, LinkedIn

Michael is a Staff Software Engineer at LinkedIn, currently making management and deployment of sharded systems a touch less painful on Kubernetes. In his free time he enjoys spending time with his cat, inhaling chocolate, and playing tennis.

LI Kubecon Cooperative Scheduling for Stateful Systems.pptx pdf

Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

11:00am MST

Kubernetes Workspaces: Enhancing Multi-Tenancy with Intelligent Apiserver Proxying - James Munnelly & Andrea Tosatto, Apple

Thursday November 14, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 255 B

Multi-tenancy in Kubernetes means sacrificing essential features like cluster-scoped list/watches and multi-namespace/cluster-scoped RBAC. This often leads to additional complexity when configuring operators and forces discrepancies and friction with cluster-as-a-service type offerings. In this talk we will go through a demonstration of an intelligent Kubernetes apiserver proxy that introduces the concept of a ‘workspace’. Borrowing the name from the KCP project, a Workspace is a virtual apiserver endpoint that provides a ‘cluster-scoped’ view over a group of namespaces in a remote cluster. We’ll then go on to discuss optimisations and changes that we’d like to make within Kubernetes to better support apiserver proxying for multi-tiered caching, routing and scoping purposes.

Speakers

James Munnelly

Staff Field Engineer, Apple

James Munnelly is a Field Engineer at Apple, helping customers adopt and adapt Kubernetes, and driving adoption of OSS cloud native technologies. James is also the founder of the cert-manager project, a Kubernetes extension for managing x509 certificates. He's an active member of... Read More →

Andrea Tosatto

Site Reliability Engineer, Apple

Andrea works at Apple as a Site Reliability Engineer. His day to day job consists in managing the lifecycle and ensuring the reliability of a multi-tenant compute platform built on top of Kubernetes. He is deeply passionate about multi-tenancy and any related topic, ranging from runtime... Read More →

Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 255 B

Emerging + Advanced

Content Experience Level Intermediate

11:00am MST

Navigating the Cgroup Transition: Bridging the Gap Between Kubernetes and User Expectations - Sohan Kunkerkar, Red Hat Inc

Thursday November 14, 2024 11:00am - 11:35am MST

Salt Palace | Level 1 | 155 B

As Kubernetes and container technologies evolve, shifting from cgroup v1 to cgroup v2 has become a pivotal development. With cgroup v2 available in Kubernetes since v1.25, we're at a crossroads where many users and organizations must decide when and how to transition fully to this new system. Despite the benefits of cgroup v2, including better resource management and enhanced capabilities, users frequently encounter unexpected challenges signaling a gap in readiness and understanding. This talk will address the practical implications of moving to cgroup v2, discuss the coordinated efforts to deprecate cgroup v1, and propose actionable strategies to bridge the gap between the Kubernetes community, system administrators, and developers. By focusing on real-world experiences and providing clear guidance, this session aims to equip you with the knowledge and tools to navigate this significant change confidently.

Speakers

Sohan Kunkerkar

Senior Software Engineer, Red Hat Inc

Sohan Kunkerkar is a Senior Software Engineer at Red Hat, bringing expertise in distributed systems, backend engineering, and containers. His active contributions extend to CRI-O, a container runtime engine, and various sub-projects within the Kubernetes Sig-Node community. Sohan... Read More →

Navigating the Cgroup Transition pdf

Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

11:00am MST

How We Made OpenTelemetry Be Our Fitness Tracker for Your CI/CD Pipelines! - Nicolas Woerner, Clario & Andreas Grabner, Dynatrace

Thursday November 14, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 250 AD

CI/CD pipelines are the heartbeat of modern cloud-native software delivery. Healthy pipelines ensure rapid and continuous deployments every time code gets committed to the Git repositories! Every new repository and commit puts more load on the CI/CD tool making it more challenging to keep this crucial heartbeat healthy! In this session, engineers from Clario will demonstrate how they leverage OpenTelemetry to observe, validate, report and optimize their CI/CD pipelines, keeping their deployments healthy despite increased scale and unlocking the full potential of modern software delivery on Kubernetes with GitLab.

Speakers

Andi Grabner

CNCF Ambassador and DevRel, Dynatrace

Andreas Grabner (@grabnerandi) has 20+ years of experience as a software developer, tester and architect and is an advocate for high-performing cloud scale applications. He is a CNCF ambassador, contributor to the CNCF project keptn and a DevRel for Dynatrace. Andreas is also a regular... Read More →

Nicolas Woerner

Associate DevOps Engineer, Clario

Nicolas Wörner works in the Platform Engineering Team at Clario. With a background in software and DevOps engineering he focuses on continuously enhancing the software delivery workflow at Clario. Nicolas is passionate about leveraging CNCF software to drive efficiency and reliability... Read More →

Kubecon NA 2024 Gitlab CICD Pipelines Otel pdf

Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

11:55am MST

How to Move from Ingress to Gateway API with Minimal Hassle - Keith Mattix, Microsoft

Thursday November 14, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | 155 E

For many, the Ingress resource was one of the first Kubernetes APIs they used, adding HTTP routing rules and SSL certs for cluster-external traffic. These APIs are used for production in clusters across the world today, configuring ingress gateways serving hundreds of thousands of connections per second. As of October 2023, the Ingress API has been superseded by the Gateway API, a new set of Kubernetes resources with over 20 implementations that enforces security best practices by design. However, migrating networking APIs is an intimidating task, and doing so safely is every company’s primary concern. Join this session to learn how to make this migration safe by identifying the best migration path, implementing Gateway API best practices, and utilizing community-supported migration tools such as ingress2gateway.

Speakers

Keith Mattix

Senior Software Engineering Lead, Microsoft

Keith Mattix is an Engineering Lead at Microsoft focused on Istio, Gateway API, and other networking projects.

How to Move from Ingress to Gateway API with Minimal Hassle pptx

Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

11:55am MST

Database DevOps: CD for Stateful Applications - Stephen Atwell, Harness.io & Christopher Crow, Pure Storage

Thursday November 14, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | Grand Ballroom A

Running stateful applications on Kubernetes can provide many of the same advantages as stateless applications. In this talk, Stephen and Chris will share some thoughts on managing stateful applications as part of a CD Pipeline so that applications - and the application's data - can be versioned and deployed safely and repeatedly. This talk will discuss managing persistent data within kubernetes, as well as managing structural changes to a database as part of a CD process. With Kubernetes and liquibase, we can provide something better than before: A more testable, repeatable, and open way to deploy stateful applications. This talk features a practical demo of how CD tooling can empower users to automate data migrations within Kubernetes.

Speakers

Christopher Crow

Technical Marketing Engineer, Pure Storage

Chris Crow works as a cloud architect at Portworx. He has worked previously as an education, systems administrator. He is a lifelong open-source enthusiast.

Stephen Atwell

Principal Product Manager, Harness.io

With over 26 years of technology experience, Stephen focuses on solving problems encountered in his previous roles. Currently he is focused on database devops at harness. He has worn hats ranging from network administrator, to database administrator, to software engineer, to product... Read More →

Harness Portworx Kubecon 2024 pdf

Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

11:55am MST

Multi-Zone Clusters Inside and Out - Tom Dean & Phil Henderson, Buoyant

Thursday November 14, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | 155 B

Multi-zone clusters are a great tool for improving application reliability — and also a great way to spend a ton of cash. Why? What really happens when you set these things up? How do you use them effectively without bankrupting your whole organization? In this session, we'll dig into the nuts and bolts of what goes on under the hood of a multi-zone cluster, including what a zone is, what Kubernetes understands about zones, how zones affect routing, and why multi-zone clusters can drive costs up. We'll spend some time on Kubernetes' Topology Aware Routing, covering its advantages as well as its very real limitations. Finally, we'll dive into how you can influence Kubernetes' choices to take advantage of multi-zone clusters' reliability while containing costs. Join us for learning and live demos!

Speakers

Phil Henderson

Customer Success Engineer, Buoyant

Tom Dean

Field Engineer, Buoyant

Tom Dean started programming BASIC on Apple IIs over 40 years ago, and has been hooked on tech since then. A long-time user of Linux and Open Source, he has been expanding his Cloud, Cloud Native and adjacent subject matter knowledge to become a more well-rounded technologist, and... Read More →

Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

11:55am MST

From Chaos to Calm: Building a Unified and Scalable CI/CD Pipeline at Akamai - Tomer Patel, Akamai Technologies Inc.

Thursday November 14, 2024 11:55am - 12:30pm MST

Salt Palace | Level 2 | 250 AD

Are you struggling with a chaotic development process? Join Akamai's talk and discover how we built a unified and scalable CI/CD pipeline, saving 40% of our QA, Performance, Dev, and Ops daily work, and how you can do that in your organization! This session dives into the architecture, key features, and its impact on development efficiency. You will learn how to: - Conquer cloud-native deployments by adding the right tools - such as Argo Rollouts, and Backstage - Integrate CI/CD tools (ArgoCD, Jenkins, DevSpace, Grafana, Prometheus, Thanos) for a smoother workflow. - Leverage best-in-breed, cost-efficient open-source solutions

Speakers

Tomer Patel

Senior Engineering Manager, Akamai Technologies Inc.

Tomer currently works as Senior Engineering Manager at Akamai Technologies, where he leads a group of Data engineers, Software developers and DevOps at scale. Previously Tomer worked as Team Lead at Clarizen (Now Planview).

From Chaos to Calm Building a Unified and Scalable CI CD Pipeline at Akamai pdf

Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

11:55am MST

What Agent to Trust with Your K8s: Falco, Tetragon or KubeArmor? - Henrik Rexed, Dynatrace

Thursday November 14, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | 151 G

In the CNCF landscape we have plenty of ebpf based security solutions that help us protect our k8s cluster from runtime vulnerabilities. On paper though Falco, Tetragon and KubeArmor look very similar. Eventually you have to make a choice on which one best fits your needs. To give you additional insights to make your decision join this session. We have run extensive benchmarks against those three solutions and will answer the following questions that came out of our testing: - What are the different featuresets? - What about the performance impact of each agent? - Which privileges does each solution need? - What are the pros and cons across the three options?

Speakers

Henrik Rexed

Cloud Native Advocate, Dynatrace

Henrik is a Cloud Native Advocate at Dynatrace, the leading Observability platform. Prior to Dynatrace, Henrik has worked more than 15 years, as Performance Engineer. Henrik Rexed Is Also one of the Organizer of the conferences named WOPR, KCD Austria and the owner of the Youtube... Read More →

falcovstetragonvskubearmor pdf

Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

2:30pm MST

Unlocking Potential of Large Models in Production - Yuan Tang, Red Hat & Adam Tetelman, NVIDIA

Thursday November 14, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 2 | 255 E

The recent paradigm shift from traditional ML to GenAI and LLMs has brought with it a new set of non-trivial LLMOps challenges around deployment, scaling, and operations that make building an inference platform to meet all business requirements an unsolved problem. This talk highlights these new challenges along with best-practices and solutions for building out large, scalable, and reliable inference platforms on top of cloud native technologies such as Kubernetes, Kubeflow, Kserve, and Knative. Which tools help effectively benchmark and assess the quality of an LLM? What type of storage and caching solutions enable quick auto-scaling and model downloads? How can you ensure your model is optimized for the specialized accelerators running in your cluster? How can A/B testing or rolling upgrades be accomplished with limited compute? What exactly do you monitor in an LLM? In this session we will use KServe as a case study to answer these questions and more.

Speakers

Yuan Tang

Principal Software Engineer, Red Hat

Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led AI infrastructure and platform teams at various companies. He holds leadership positions in open source projects, including Argo, Kubeflow, and Kubernetes. He's also a maintainer and... Read More →

Adam Tetelman

Principal Product Architect, NVIDIA

Adam Tetelman is a principal architect at NVIDIA leading cloud native initiatives and CNCF engagements across the company; building inference platforms for NVIDIA AI Enterprise and DGX Cloud. He has degrees in computational robotics, computer & systems engineering, and cognitive science... Read More →

KubeCon NA 2024 Yuan and Adam Unlocking Potential of Large Models in Production.pptx 1 pdf

Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

2:30pm MST

How the Tables Have Turned: Kubernetes Says Goodbye to Iptables - Casey Davenport, Tigera & Dan Winship, Red Hat

Thursday November 14, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | 155 E

For decades, iptables has been the preferred packet filtering system in the Linux kernel. Used extensively across the Kubernetes networking ecosystem, iptables is now on the way out and is expected to be removed from the next generation of Linux distributions. With iptables past its prime, where does that leave Kubernetes? The successor to iptables -- nftables -- is ready to carry the torch instead, with a newly released beta kube-proxy implementation in v1.31 and network policy using Calico’s nftables backend. In this talk, Dan and Casey will share what they have learned building Kubernetes Service and NetworkPolicy implementations using nftables. They will cover the history and current status of iptables usage in Kubernetes, the capabilities and performance characteristics of Kubernetes networks running on nftables, and why eBPF may not be the right tool for the job.

Speakers

Casey Davenport

Casey Davenport, Tigera

Casey is a core developer on Calico and has been building Kubernetes networking systems since 2016.

Dan Winship

Senior Principal Software Engineer, Red Hat

Dan is a Tech Lead for Kubernetes SIG Network and has been working on Kubernetes and OpenShift networking at Red Hat since 2016.

how the tables have turned.pptx pdf

Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

2:30pm MST

Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster - Yuichiro Ueno & Toru Komatsu, Preferred Networks, Inc.

Thursday November 14, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Grand Ballroom A

Today, storage technologies play a fundamental role in the realm of AI/ML. Read performance is essential for swiftly moving datasets from storage to AI accelerators. However, the rapid enhancement of AI accelerators' performance often outpaces I/O, bottlenecks the training. Due to the scheduling of pods in Kubernetes across multiple nodes, utilizing node-local storage effectively presents a challenge. To address this, we introduce a distributed cache system built atop node-local storages, designed for AI/ML workloads. This cache system has been successfully deployed on our on-premise 1024+ GPUs Kubernetes cluster within a multi-tenancy environment. Throughout our two-year experience operating this cache system, we have overcome numerous hurdles across several components, including the I/O library, load balancers, and the storage backend. We will share the challenges and the solutions we implemented, leading to a system delivering 50+ GB/s throughput and less than 2ms latency.

Speakers

Toru Komatsu

Engineer, Preferred Networks, Inc.

Toru is a machine learning platform engineer at Preferred Networks in Japan. He is the creator and lead developer of youki, an OCI Runtime in Rust, and a maintainer of the OCI Runtime Specification. Additionally, he serves as a reviewer for runwasi and is involved in developing a world that utilizes containers and Wasm. Additionally, he is a member of the Kubernetes org and is especially interested in... Read More →

Yuichiro Ueno

Engineer, Preferred Networks, Inc.

He is currently a machine learning platform engineer at Preferred Networks in Japan. His research and engineering interests include a range of high-performance computing (distributed deep learning, networking/RDMA, and storage technologies), performance engineering, and Kubernete... Read More →

KubeCon NA 2024 Distributed Cache Empowers AI ML Workloads on Kubernetes Cluster pdf

Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

2:30pm MST

Low-Overhead, Zero-Instrumentation, Continuous Profiling for OpenTelemetry - Christos Kalkanis, Elastic

Thursday November 14, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Grand Ballroom B

Elastic has recently donated its whole-system continuous profiling agent to OpenTelemetry. After a thorough community review process, the donation was enthusiastically accepted. Leveraging eBPF, the profiling agent provides unprecedented visibility into the runtime behavior of all applications: it builds stacktraces that go from the kernel to userspace native code, all the way into code running into higher level runtimes, enabling users to identify performance regressions, reduce wasteful computations, and debug complex issues faster. This session will explore: - Benefits of eBPF-based continuous profiling compared to conventional approaches that rely on application instrumentation - How the agent builds profiles that seamlessly span kernel, native code and most widely used application runtimes - Integration with the rest of OpenTelemetry: OTLP and Collector

Speakers

Christos Kalkanis

Principal Software Engineer, Elastic

Christos is a principal engineer at Elastic, a maintainer for the OpenTelemetry Profiling SIG and a co-author of the donated OpenTelemetry profiling agent previously known as the Elastic Universal Profiling agent. After more than a decade of focusing on cybersecurity offense he moved... Read More →

Continuous Profiling OTel Christos Kalkanis pdf

Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom B

Observability

Content Experience Level Intermediate

2:30pm MST

One Inventory to Rule Them All: Standardizing Multicluster Management - Corentin Debains, Google & Ryan Zhang, Microsoft

Thursday November 14, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | 155 B

Most Kubernetes users run more than one cluster, and some run hundreds or more. Crossing cluster boundaries has always been a challenge, because most Kubernetes APIs, tools, and operators are cluster-centric. In fact, there’s a remarkable lack of standard tools and patterns for multi-cluster. Over time users have found ways to stitch clusters together but the community has been asking for standardization.To share multi-cluster tools, Kubernetes sig-multicluster has introduced the “ClusterProfile” API, a critical building block for multi-cluster capabilities. This API provides a canonical way for multicluster controllers and users to iterate over clusters, and to install or manage multi-cluster features. In this talk, we will look at some of the problems inherent to multi-clustering, explain the concepts introduced by this new API and look at implementations and consumers of it.We dive into real life examples of patterns and usage, with products such as Kueue, ArgoCD, and Argo workflow.

Speakers

Ryan Zhang

Principal Software Engineering Manager, Microsoft

Dr. Ryan Zhang is a Principal Software Engineering Manager working in Azure Kubernetes Service at Microsoft. He received his Ph.D. from Rice University, specializing in Grid computing. With over 15 years of experience in software engineering, he has managed teams of software engineers... Read More →

Corentin Debains

Software Engineer, Google

Corentin Debains is a software engineer at Google working on the GKE Fleet (multicluster platform). He is an active member of Kubernetes’ special interest group sig-multicluster.

KubeconNA24 One Inventory to Rule Them All.pptx pdf

Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

2:30pm MST

Mastering Cell-Based Architecture: Practical Solutions and Best Practices - Shweta Vohra, Booking.com & Asanka Abeysinghe, WSO2

Thursday November 14, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 2 | 250 AD

Are you struggling to validate your cell boundaries or facing challenges with greenfield versus brownfield cell-based architectures (CBA)? Do you find it difficult to define enterprise-wide cell boundaries or wish there were best practices to guide you? If these pain points sound familiar, this session is tailored for you. In this talk, we will first guide you through the process of defining an enterprise-wide cell-based architecture for your organization or context. Then we will explore best practices for greenfield, brownfield, and hybrid cell implementations using CBA. By translating common user challenges into actionable implementation references, we aim to elevate your understanding of CBA with real-world use cases and best practices. This session will also cover best practices for the data, security, application, and infrastructure layers, ensuring a comprehensive approach to CBA implementation. Join us to take your knowledge of CBA to the next level!

Speakers

Shweta Vohra

Lead Architect, Booking.com

Shweta Vohra is an Architect, Author, and Inventor with over 20 years of experience in the software industry. Her expertise spans from complex embedded systems design to hybrid cloud-native solutions, and most recently, the creation of data and machine learning platforms. She is the... Read More →

Asanka Abeysinghe

CTO, WSO2

Asanka, WSO2's CTO, is a technology visionary with over 20 years of experience designing and implementing scalable distributed systems, microservices, and business integration solutions. He advances WSO2's corporate reference architecture, collaborates with customers and industry... Read More →

CBA KubeConNA 2024 V1.0 pdf

Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

2:30pm MST

From Standards to Practice: The Journey to Container Maturity - Carmen Chow & Thomas Robinson, Yelp

Thursday November 14, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | 151 G

Yelp runs tens of thousands of Docker containers in Kubernetes. How do we track their vulnerabilities, baseline their security needs, and prioritize our most critical findings? Security standards change constantly, so we need a robust model of container maturity to guide our adoption of these standards in a way that addresses Yelp’s specific needs and risk tolerance. Finally, to maximize our model’s value, over 1,000 engineers must understand its practical guidance well enough to apply it to their daily work. This talk covers designing and incorporating a container maturity model into Yelp’s development lifecycle, along with our strategy for proactively improving our security posture. We believe our experiences will assist others in creating similar models that work for their organizations, help evaluate and assess risks to their own containers, and drive next steps towards future risk evaluation platforms.

Speakers

Carmen Chow

Software Engineer, Yelp

Carmen Chow is a Software Engineer on Yelp’s Infrastructure Security team, where she has worked on cost modeling, data lifecycle tools, and Kubernetes observability. Previously, she was an infrastructure developer responsible for containerizing services and migrating them to Kubernetes... Read More →

Thomas Robinson

Software Engineer, Yelp

Tom is a software engineer living near Seattle, Washington. Having previously worked in security research and antivirus software, he's spent the last decade helping keep Yelp secure.

Journey to Container Maturity pdf

Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

3:25pm MST

Kubernetes Multi-Cluster Networking 101 - Niranjan Shankar, Microsoft & Ram Vennam, Solo.io

Thursday November 14, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | 155 E

You’ve (somewhat) grasped the networking model of a single Kubernetes cluster. But how do you enable Pods to communicate across clusters? How do service discovery and DNS work for a multi-cluster setup? How do you secure inter-cluster traffic and manage certificates? Not sure? Don’t worry - this session will have the answers. We’ll start by outlining the core requirements for workloads to communicate across clusters. You’ll then learn some common multi-cluster networking topologies, like flat and multi-network setups, and how inter-cluster connectivity and IP address management differ for each of them. Finally, we’ll cover some popular tools for managing and securing traffic between clusters, like service mesh, CNIs, and gateways, and discuss their use-cases. You’ll leave this session with a solid understanding of fundamental terms and concepts - like virtual networking peering, external DNS, trust domains, etc - needed for navigating the multi-cluster networking landscape.

Speakers

Ram Vennam

Solutions Engineer, Solo.io

Ram Vennam is the Director of Solutions Engineering at Solo.io where he helps companies design and build highly scalable, resilient, distributed systems with the latest cloud-native technology. Previously, he was at IBM where he was a Technical Product Manager and Developer Advocate... Read More →

Niranjan Shankar

Senior Software Engineer, Microsoft

Niranjan Shankar is a senior software engineer at Microsoft working on the Istio-based service mesh add-on for Azure Kubernetes Service (AKS). He has experience with multi-cluster operations, edge traffic management and security, GitOps-based patterns, and policy enforcement with... Read More →

Kubernetes Multi Cluster Networking 101 pdf

Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

3:25pm MST

Elastic Data Streaming: Autoscaling Apache Kafka - Jakub Scholz, Red Hat

Thursday November 14, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | Grand Ballroom A

Autoscaling is an important part of modern cloud-native architecture. It allows applications to handle a big load at peak times while helping to optimize costs and make deployments more green and sustainable at the same time. Apache Kafka is well known for its scalability. It can grow with your project from a small cluster up to hundreds of brokers. But it was not very elastic for a long time and using dynamic autoscaling with it was very hard. This talk will guide the attendees through the main challenges of auto-scaling Apache Kafka on Kubernetes. It will show how these challenges can be solved with the help of new features added recently in Strimzi and Apache Kafka projects such as auto-rebalancing, node pools, or tiered storage. And it will help the users get started with the auto-scaling of Apache Kafka.

Speakers

Jakub Scholz

Senior Principal Software Engineer, Red Hat

Jakub works at Red Hat as Senior Principal Software Engineer. He has long-term experience with messaging and currently focuses mainly on Apache Kafka and its integration with Kubernetes. He is one of the maintainers of the Strimzi project which provides tooling for running Apache... Read More →

Elastic Data Streaming Autoscaling Apache Kafka pdf

Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

3:25pm MST

Load-Aware GPU Fractioning for LLM Inference on Kubernetes - Olivier Tardieu & Yue Zhu, IBM

Thursday November 14, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 2 | 255 B

As the popularity of Large Language Models (LLMs) grows, LLM serving systems face challenges in efficiently utilizing GPUs on Kubernetes. In many cases, dedicating an entire GPU to a small or unpopular model is a waste, however understanding the relationship between request load and resource requirements has been difficult. This talk will study GPU compute and memory requirements for LLM inference servers, like vLLM, revealing an analytical relationship between key configuration parameters and performance metrics such as throughput and latency. This novel understanding makes it possible to decide at deployment time an optimal GPU fraction based on the model's characteristics and estimated load. We will demo an open-source controller capable of intercepting inference runtime deployments on Kubernetes to automatically replace requests for whole GPUs with fractional requests using MIG (Multi-Instance GPU) slices, increasing density hence LLM sustainability without sacrificing SLOs.

Speakers

Olivier Tardieu

Principal Research Scientist, Manager, IBM

Dr. Olivier Tardieu is a Principal Research Scientist and Manager at IBM T.J. Watson, NY, USA. He joined IBM Research in 2007. His current research focuses on cloud-related technologies, including Serverless Computing and Kubernetes, as well as their application to Machine Learning... Read More →

Yue Zhu

Staff Research Scientist, IBM Research

Dr. Yue Zhu is a Staff Research Scientist at IBM Research specializing in foundation model systems and distributed storage systems. Yue obtained a Ph.D. in Computer Science from Florida State University in 2021 and has consistently contribute to sustainability for foundation models... Read More →

AutoFit pdf

Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 B

Emerging + Advanced

Content Experience Level Intermediate

3:25pm MST

Measuring All the Costs with OpenCost Plugins - Alex Meijer, Stackwatch

Thursday November 14, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | Grand Ballroom B

The CNCF OpenCost project is approaching 5,000 stars on GitHub and has become one of the most popular cost monitoring systems in use. Originally focused on cloud provider and Kubernetes cost monitoring, OpenCost expanded its scope in May 2024 by launching OpenCost Plugins with Datadog as the first reference implementation. These plugins allow users to measure and visualize virtually any cost in OpenCost, without writing a single line of OpenCost code. Alex Meijer, OpenCost and OpenCost Plugins maintainer, will speak on how the OpenCost Plugins ecosystem works and will dive into the use of the open-source FOCUS spec in OpenCost, which is the key to being able to measure nearly any cost. A plugin-enabled OpenCost deployment will be demoed, with an external cost (Datadog) visualized alongside the traditional Kubernetes and cloud provider costs. Alex will also share how to get started with plugins so that users can start analyzing the costs of whatever matters to their unique use case!

Speakers

Alex Meijer

Staff Software Engineer, Stackwatch

Alex Meijer has been working with Kubernetes for his entire career, being at various times a user, operator, and currently as someone working to help others use Kubernetes better. He has served in startups ranging in size from 5-90 people. Alex contributes to the Opencost project... Read More →

Measuring All The Costs w OC Plugins pdf

Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom B

Observability

Content Experience Level Intermediate

3:25pm MST

From Chaos to Harmony, Transforming ML Engineering: A Kubernetes Adoption Journey - Paris Nakita Kejser, JP Politikens Hus

Thursday November 14, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | Grand Ballroom H

How Ekstra Bladet’s Data Science team went from a small team of ML engineers, who needed to deliver quickly without deep technical infrastructure knowledge, to a rigid and proprietary ML pipeline built from AWS components and triggered by a large and chaotic Infrastructure as Code project. This made it difficult to achieve freedom and required a lot of work to implement and debug. One of the key reasons for adopting Kubernetes for our ML team emerged when we realized that we should serve all stakeholders across the JP/Politikens Hus organization, not just Ekstra Bladet. We then chose Kubernetes as our container infrastructure, which transformed the ML team into a dynamic ML ecosystem with great freedom under responsibility.

Initially, we focused on building robust frameworks for training and deploying ML models as API services and model training. Today, our ML team operates at the forefront of innovation, where we embrace GitOps principles to streamline our machine learning platform. Through careful adoption of advanced techniques such as autoscaling, scheduling, event triggers, and dynamic service deployment, we ensure seamless integration of new ML models into our infrastructure. This evolution has allowed us to effectively meet our diverse needs, while maintaining agility and scalability in our ML operations.

Speakers

Paris Nakita Kejser

Cloud Engineer, JP | Politiken Media Group

As a certified Cloud Engineer specializing in AWS and Kubernetes, I'm integral to Ekstra Bladet’s Data Science team. My focus lies in optimizing cloud infrastructure, integrating AWS and Kubernetes setups, and driving technological advancements. I contribute to Ekstra Bladet's digital... Read More →

From Chaos to Harmony, Transforming ML Engineering A Kubernetes Adoption Journey pdf

Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom H

Platform Engineering

Content Experience Level Intermediate

4:30pm MST

Microsegment Your Network Like Mastercard with AdminNetworkPolicy - John Zaiss & Daniel Ruggeri, Mastercard & Surya Seetharaman, Red Hat

Thursday November 14, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 1 | 155 E

Do you manage Kubernetes clusters and need to enforce airtight workload security on a cluster-wide level? This is vital in the Financial Services industry to comply with the PCI Data Security Standard. Mastercard was looking for a built-in Kubernetes solution enabling admins to govern network access between workloads at scale. While exploring different options, they found namespace-scoped NetworkPolicies but wanted to avoid duplicating policies for each namespace. When Kubernetes SIG-Network added AdminNetworkPolicies in v1.25, Mastercard found what they needed! In this session, we will introduce AdminNetworkPolicy and demonstrate applying granular, non-overridable network controls on a live cluster for multi-tenant isolation. Join us to learn how Mastercard is securing microservices in production based on the principle of least privilege and zero trust. We will also share our operational challenges and lessons learnt. Attendees will gain actionable strategies to secure clusters.

Speakers

Daniel Ruggeri

Distinguished Engineer, Mastercard

Daniel is Distinguished Software Engineer at Mastercard and an Open Source evangelist. Responsible for setting the direction of Mastercard regarding the Web, Cloud, amd infrastructure automation space, he spends his days and nights playing with infrastructure and the code that powers... Read More →

John Zaiss

Principal Software Engineer, Mastercard

As a Principal Engineer, John brings extensive expertise in Kubernetes, automation, cloud identity architecture, server architecture, VMware ESX, mobile device management, and IT strategy. He is a seasoned information technology professional with a BS in Cybersecurity and a MS in... Read More →

Surya Seetharaman

Principal Software Engineer, Red Hat Inc.

Microsegment Your Network Like Mastercard with AdminNetworkPolicy pdf

Thursday November 14, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

4:30pm MST

Per-Node Api-Server Proxy: Expand the Cluster's Scale and Stability - Weizhou Lan & Iceber Gu, DaoCloud

Thursday November 14, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 1 | 155 B

For lots of CNCF projects, kinds of daemonsets simultaneously synchronize datas from the Api-server from each node. Especially in large-scale clusters, it creates significant pressure on the Api-server, burdens the network, even affects the stability of the cluster. Some projects have implemented optimization to address this. For instance, Cilium aggregates endpoint information into the CRD CiliumEndpointSlice before distributing it to its daemonset. However, many projects have not yet adopted such data aggregation optimizations and Currently, there is still no project to help improve the communication between all components and the Api-server. ClusterPedia supports to launch per-node Api-server proxies to serve all local pods, and utilize eBPF to resolve the API server's clusterIP to the local proxy, which transparently implements API server access redirection on demand. In large-scale clusters, this can significantly improve the stability of all cluster's services.

Speakers

Iceber Gu

Software Engineer, DaoCloud

Senior open source enthusiast, focused on cloud runtime, multi-cloud and WASM. I am a CNCF Ambassador and founded Clusterpedia and promoted it as a CNCF Sandbox project. I also created KasmCloud to promote the integration of WASM with Kubernetes and contribute it to the WasmCloud... Read More →

Weizhou Lan

Senior Tech Lead, Daocloud

Weizhou Lan, 13+ years of engineering experience, engaged in kubernetes since 2018. a senior tech lead at Daocloud focusing on private cloud, a speaker at KubeCon NA/EU and KCD China, a Program Committee Member for KubeCon, the initiator and maintainer of the CNCF sandbox project... Read More →

Per Node Api Server Proxy Expand The Cluster Scale And Stability pdf

Thursday November 14, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

4:30pm MST

Mish-Mesh: Abusing the Service Mesh to Compromise Kubernetes Environments - Hillai Ben-Sasson & Nir Ohfeld, Wiz

Thursday November 14, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 1 | 151 G

Service mesh solutions are common components in almost every large Kubernetes environment. Many engineers and security teams have adopted solutions like Linkerd and Istio to better segment and isolate their Kubernetes networks. In this talk, we will demonstrate how we were able to exploit common misconfigurations and insecure features in popular service mesh solutions, to escalate low-severity vulnerabilities to critical service takeovers. Our real-life examples include several major cloud service providers, where these vulnerabilities allowed us to gain unauthorized access to internal systems and sensitive secrets. This talk will help engineers understand whether their service mesh deployment acts as a proper security barrier, and how to make sure that it does. Security teams – both attackers and defenders – will learn new techniques for hacking Kubernetes environments, and how to properly defend against them.

Speakers

Hillai Ben-Sasson

Nir Ohfeld

Security Researcher, Wiz

Nir Ohfeld is a 25-years-old senior security researcher at Wiz. Ohfeld focuses on cloud-related security research and specializes in research and exploitation of cloud service providers, web applications, application security, and in finding vulnerabilities in complex high-level systems... Read More →

Thursday November 14, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

11:00am MST

Better Together! GPU, TPU and NIC Topological Alignment with DRA - John Belamaric, Google & Patrick Ohly, Intel

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 250 AD

AI/ML workloads on Kubernetes demand ultra-high performance. If your training or multi-GPU inference job spans nodes, your GPUs will use the network, talking through a NIC over local PCIe. But not all NICs are equal! To get the best performance, you need a NIC which is as "close" to the GPU as possible. Unfortunately, the Kubernetes extended resources API does not have enough information and does not give you control over which specific devices are assigned. Dynamic Resource Allocation, the successor API, gives you this power. Come to this session to learn about DRA, how it is improving overall device support in K8s, and how to use it to allocate multiple GPUs, NICs, and TPUs to get the maximum performance out of your infrastructure.

Speakers

Patrick Ohly

Principal Engineer, Intel

Patrick Ohly is a software engineer at Intel GmbH, Germany. In the past he has worked on performance analysis software for HPC clusters ("Intel Trace Analyzer and Collector") and cluster technology in general (PTP and hardware time stamping). Since January 2009 he has worked for Intel... Read More →

John Belamaric

Senior Staff Software Engineer, Google

John is a Sr Staff SWE, co-chair of K8s SIG Architecture and of K8s WG Device Management, helping lead efforts to improve how GPUs, TPUs, NICs and other devices are selected, shared, and configured in Kubernetes. He is also co-founder of Nephio, an LF project for K8s-based automation... Read More →

[PUBLIC] 2024 KubeCon NA Better Together! GPU, TPU and NIC Topological Alignment with DRA pdf

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 250 AD

AI + ML

Content Experience Level Intermediate

11:00am MST

Securing Outgoing Traffic: Building a Powerful Internet Egress Gateway for Reliable Connectivity - Edie Yang & Akshita Agarwal, Airbnb

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 1 | 155 E

Concerned about secure and reliable outgoing traffic from your organization's mesh network? With the increasing demand to use external vendor apis for LLMs, along with vulnerabilities like Log4j, the need for preventing data exfiltration and maintaining strong safeguards is critical. But managing access to multiple external domains within the service mesh can be daunting. Discover the secrets behind building a powerful Internet Egress gateway using Istio and Envoy. This enlightening talk unveils a way to define fine-grained access policy to monitor and audit outgoing traffic from your mesh network. Besides, it demonstrates how to build a generic multi-tenant gateway that can be used across heterogeneous services and save years of repeated engineering work. By the end of the talk, attendees will gain an understanding of what an Internet Egress Gateway is, why it is necessary, and how they can configure it for their own services using the open-source Istio/Envoy based solution.

Speakers

Akshita

Senior Software Engineer, Airbnb

Akshita is a Senior Software Engineer at Airbnb working in the Service Mesh team which the handles interservice networking at scale. She currently is focused on designing a secure network edge solution at Airbnb. Previously she worked at Microsoft developing the Nginx Load Balancer... Read More →

Edie Yang

Senior Software Engineer, Airbnb

Edie is a Senior Software Engineer at Airbnb on the Cloud Infrastructure team which develops the Service Mesh system that powers the entire Airbnb stack. Edie has been working on developing service mesh API, service migration automation, Google IAP-based ingress gateway and internet... Read More →

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

11:00am MST

How We Scale a Distributed SQL Database to 1 PB - Jinpeng Zhang, PingCAP

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 1 | Grand Ballroom A

TiDB is a distributed SQL database that we built to solve the scalability problems of traditional SQL databases such as MySQL and PostgreSQL. Using TiDB, users do not need to shard their data across multiple MySQL or PostgreSQL database instances, nor do they need to sacrifice some key database features such as JOIN and transactions. Users only need to add storage nodes and computing nodes to the cluster as needed. However, we also encountered many scalability challenges when building TiKV - the stateful storage layer of TiDB. Challenges such as workload skew issues making it difficult to scale performance, management challenges of millions of dynamic data partitions, latency impact during scaling, interference between different workloads when consolidating multiple workloads into the same cluster, etc. In this talk, I will provide an in-depth look at these challenges and our solutions.

Speakers

Jinpeng Zhang

Director of Engineering, PingCAP

Director of Engineering at PingCAP, TiKV maintainer and committer, RocksDB contributor, the author of "MariaDB Principles and Implementation". Mainly engaged in the design and development of cloud-native large-scale distributed storage systems, data platforms, 10+ years of experience... Read More →

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

11:00am MST

Upgrade Safely: Avoid the Pitfalls of Kubernetes Versioning - Rob Scott, Google

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 254 B

Have you ever upgraded a cluster or controller only to realize everything was broken due to some kind of versioning mismatch? Do you remember the pain of upgrading to a new Kubernetes API version like Ingress v1? Do you get a little twinge any time you see a feature or API deprecated in release notes? This is the talk for you. Kubernetes versioning is surprisingly complex and widely misunderstood. This talk will cover all the relevant versioning concepts, from storage versions to feature gates. It will show how they interact with each other, and how you can use this information to safely and confidently upgrade your clusters and controllers. This talk will provide real examples of how versioning mixups can lead to broken clusters and downtime. You’ll learn exactly how you can avoid each of these potential failure modes, and gain some insights into how API and Controller authors are trying to minimize the impact of these kinds of changes in the future.

Speakers

Rob Scott

Software Engineer, Google

Rob is an open source enthusiast currently working on Kubernetes Networking at Google. He's been a maintainer of Gateway API since the very early days of the project and led the development of other Kubernetes networking APIs like EndpointSlices.

KubeCon SLC Upgrade Safely Avoid the Pitfalls of Kubernetes Versioning pdf

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 254 B

Operations + Performance

Content Experience Level Intermediate

11:00am MST

Share the Ride: Robust Multi-Tenancy in Kubernetes at Uber - Sashank Appireddy & Apoorva Jindal, Uber

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 251 AD

Multi-tenancy in Kubernetes involves the coexistence of multiple users or teams (tenants) on a single Kubernetes cluster while ensuring isolation, security, and performance. Our use cases at Uber span from scenarios with disruptive neighbors to those with large container sizes, specialized hardware, sticky placement preferences, and dynamic resource scaling demands, necessitating robust isolation measures. In this proposal, we present a comprehensive exploration of multi-tenancy in Kubernetes, covering strategies, the challenges we have faced and the effective solutions implemented to overcome them at Uber. Further, we will deep dive into the key aspects of building and managing multi-tenant Kubernetes clusters, by establishing strong tenant boundaries leveraging the ideas around node pools and tightly integrating with namespaces.

Speakers

Apoorva Jindal

Senior Staff Software Engineer, Uber Inc

Apoorva Jindal is working as Senior Staff Software Engineer at Uber. At Uber, he leads the Compute platform which powers all stateless and batch containerized workloads at Uber.

Sashank Reddy

Staff Software Engineer, Uber Technologies Inc

I am software engineer with over a decade of experience specializing in containerization and distributed systems. As a Staff Software Engineer in the container platform team at Uber Technologies Inc, I lead the design, development and deployment of scalable multi-tenant architecture... Read More →

KubeCon2024 MultiTenancy pdf

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 251 AD

Platform Engineering

Content Experience Level Intermediate

11:55am MST

Improving Service Availability: Scaling Ahead with Machine Learning for HPA Optimization - Avni Sharma & Estela Ramirez, Intuit

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 2 | 255 E

In this talk, we will explore employing machine learning (ML) algorithms to enhance the Kubernetes autoscaling capabilities beyond the traditional, reactive horizontal pod autoscaler (HPA). Attendees will be introduced to how to leverage recommendation algorithms to predict future load and usage patterns, allowing for smarter, proactive scaling decisions. This approach not only ensures high availability and responsiveness of applications but also offers a pathway to substantial cost optimizations by preventing over-provisioning and minimizing resource wastage.

Speakers

Avni Sharma

Product Manager, Intuit

Avni is a Product Manager at Intuit, working on Intuit’s Modern SaaS Kubernetes platform. She also worked on ArgoCD as a PM. Avni is passionate about Developer tooling and strives to make developers' life easier by delivering them delightful experiences. She is also an Open Source... Read More →

Estela Ramirez

Software Engineer, Intuit Kubernetes Service, Intuit

Estela is a Software Engineer at Intuit focusing on Intuit Kubernetes Developer Platform. She works on abstracting the autoscaling for developers.

Improving Service Availability Scaling ahead pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate
Presentation Slides Attached Yes

11:55am MST

Seeing Double? Implementing Multicast with eBPF and Cilium - Louis DeLosSantos, Isovalent at Cisco

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | 155 E

Multicast is a popular networking technology used in finance, telecommunications, and media CDNs, among others to efficiently replicate and deliver data streams to multiple clients. However, this advantage can be overshadowed by the complexity involved in configuring the necessary infrastructure leaving the overworked platform team rather than the end users seeing double. To combat this complexity, Cilium explored using eBPF to implement pod-to-pod multicast delivery within a Kubernetes cluster. This talk will provide both a high and low level understanding of how eBPF can be used to implement multicast delivery. It will discuss how Cilium’s multicast works and the hurdles faced by the project along the way. By the end of this talk the audience will have a better understanding of how multicast functions, how eBPF can be used in-place of traditional multicast infrastructure, and how Cilium can be used as a multicast-enabled CNI, letting your audience - and not you- see double.

Speakers

Louis De Los Santos

Louis DeLosSantos, Isovalent at Cisco

Louis DeLosSantos is a multi-disciplined technologist who has worn network, systems, and software engineer hats at various times. Presently he works at Isovalent at Cisco where he focuses on Linux Kernel networking and implementing eBPF datapath networking solutions.

Seeing Double? Implementing Multicast with eBPF and Cilium pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

11:55am MST

Kubernetes on Multisites – A Story About Stateful App, Hybrid Clouds, and High Availability - Florian Coulombel, Dell Technologies & Jan Šafránek, Red Hat

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | Grand Ballroom A

The day has come! Kubernetes has won the hearts and minds of your leadership and entire organizations, and everyone wants to benefit. Projects are launched to migrate legacy apps, run proprietary systems, and even use virtual machines in your Kubernetes infrastructure! But wait a minute. VMs and good' ol RDBMS are not microservices developed with 12 factors in mind where data is either hosted on an external service or replicated by the application. How are we going to warranty the availability of these applications and systems? Do I need to do a backup of these things? What if my business is fragmented across edge, on-prem, and public clouds? Members from SIG Storage will guide you through the options to compose with, including the latest CSI features, Kubernetes architecture design, and even hardware solutions. We will evaluate the benefits to consider and the pitfalls to avoid when implementing stateful workloads in Kubernetes on multiple sites.

Speakers

Jan

Software Engineer, Red Hat

Jan is a Senior Principal Software Engineer at Red Hat working on storage aspects of Kubernetes. He started developing Kubernetes more than 8 years ago, and is one of the founding members of SIG-Storage. He’s the author of PersistentVolume controller, dynamic provisioning and StorageClass... Read More →

Florian Coulombel

Senior Software Engineer, Dell Technologies

Father of 2, living in France. Nerd since 1996 when Quake alpha version leaked, Linux user since 2001, Kubernetes enthusiast since 2016, member of Kubernetes SIG Storage since 2023.

KCNA24 k8s on multisites pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

11:55am MST

Love thy (Noisy) Neighbor: Strategies for Mitigating Performance Interference in Cloud-Native Systems - Jonathan Perry, PerfPod

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | 155 B

In cloud-native environments, application performance often degrades due to contention over shared resources such as CPU caches and memory bandwidth. Current container technologies lack mechanisms to isolate these resources, which compels operators to maintain low utilization by scaling out their deployments. This session explores strategies used by hyperscalers like Google, Microsoft, Facebook, and Alibaba to mitigate such performance interference. We will review their published methodologies, extracting key principles that could guide the development of a Kubernetes-native performance isolator. Participants will gain insights into the design trade-offs and operational impacts of these tools. Additionally, we will discuss integration strategies for deploying such isolators in existing Kubernetes environments, aiming to optimize resource utilization while preserving application performance.

Speakers

Jonathan Perry

Founder & CEO, PerfPod

Jonathan Perry is a maintainer of the OpenTelemetry eBPF network collector. His PhD research at MIT CSAIL focused on performance isolation in datacenter and cloud networks, aiming to enhance network efficiency and reduce latency. Jonathan founded Flowmill, where he developed eBPF-based... Read More →

Slides Kubecon NA'24 Love thy (Noisy) Neighbor pdf

Transcript and Slides Love thy (Noisy) Neighbor pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

11:55am MST

What Containerd 2.0 Means for You - Samuel Karp, Google

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 2 | 254 B

containerd 2.0 is the first major new version of containerd since 1.0.0 was released in 2017. This new version of containerd introduces new features, new extension points, and new backends for image operations and CRI with the goal of increased flexibility and better efficiency for certain types of workloads. containerd 2.0 also removes some previously-deprecated features in favor of modern replacements. This talk will discuss how to prepare for containerd 2.0 in your production environments, including strategies for incorporating containerd 2.0's new functionality and detecting/remediating any impact of removed features prior to upgrading.

Speakers

Samuel Karp

Staff Software Engineer, Google

Samuel Karp is a containerd maintainer and a Staff Software Engineer at Google, focused on the container runtime for Google Kubernetes Engine. Sam has been involved in the container ecosystem since 2014 and serves as the Chair of the Open Container Initiative's Technical Oversight... Read More →

[Export] What containerd 2.0 means for you pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 254 B

Operations + Performance

Content Experience Level Intermediate

11:55am MST

Still Don't Do What Charlie Don't Does - Making CRD Changes Safer - Nick Young, Isovalent

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 2 | 251 AD

Many Kubernetes installations use controllers that include Custom Resource Definitions (CRDs) to extend their capabilities. However, because CRDs can only have one version installed in a cluster at any one time, version and change management can be very difficult. This talk will benefit both controller implementers and users. For implementers, I have tips on how to more safely make API changes to their CRDs, and for CRD users, some tips on what to look out for when installing CRD updates. All of this is based on using experience from projects like Contour, Gateway API, and Cilium among others. Learn things like: Different CRD version management strategies - what’s worked and what hasn’t How to make schema changes like pluralizing a field or changing field validation in a safe way How not to make the same mistakes I did Expect to come away from this talk having learned from my painful experiences handling CRD changes badly, but also having heard a bunch of Simpsons references.

Speakers

Nick Young

Senior Software Engineer, Isovalent at Cisco

Nick has been working to prevent the entropic downfall of systems for 25 years, across datacenters, clouds, networking, and others. He's a Staff Engineer at Isovalent, and a maintainer on the Kubernetes Gateway API project, where he works on improving the ingress and mesh experiences... Read More →

Still Don't Do What Charlie Don't Does.pptx pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 251 AD

Platform Engineering

Content Experience Level Intermediate

11:55am MST

Rogue No More: Securing Kubernetes with Node-Specific Restrictions - Anish Ramasekar, Microsoft & James Munnelly, Apple

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | 151 G

Did you know that a component running across multiple nodes, such as in a daemonset, intended to perform node-specific actions, can pose a significant security risk? If any node the component is running on goes rogue, it can lead to attacks on the cluster, or even worse, a complete takeover of it. What if we could restrict the component's ability to write resources only to those belonging to the node it is running on to prevent such escalation attacks? In this talk, Anish and James will introduce new Kubernetes security enhancements to bound service account tokens, which can be used with validating admission policies to enforce per-node restrictions on service accounts. This session will provide you with practical implementation guidelines and show you how these enhancements can mitigate risks and protect your infrastructure with robust node isolation.

Speakers

James Munnelly

Staff Field Engineer, Apple

Anish Ramasekar

Principal Software Engineer, Microsoft

Rogue No More Securing Kubernetes with Node Specific Restrictions pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

2:00pm MST

Bloomberg’s Journey to Improve Resource Utilization in a Multi-Cluster Platform - Yao Weng & Leon Zhou, Bloomberg

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 2 | 250 AD

Bloomberg provides an on-premises Data Science Platform (DSP) using cloud-native software to support internal AI model training. It runs on Kubernetes clusters spanning multiple data centers and featuring a diverse range of GPU types. However, managing such a large-scale and heterogeneous GPU environment poses many challenges, such as improving resource utilization, reducing operational costs, and scheduling workloads across different GPU types. In collaboration with the Karmada community, Bloomberg's DSP team has aimed to tackle these challenges by addressing multi-cluster batch job management problems. This talk will delve into the approaches the team has adopted, including: - Intelligently scheduling GPU workloads across multiple clusters - Using Karmada's resource interpreter to support Kubernetes Custom Resource Definitions (CRDs) on top of a multi-cluster architecture - Building a highly available Karmada control plane - Establishing a consistent training job submission interface

Speakers

Leon Zhou

Software Engineer, Bloomberg

Leon Zhou is a software engineer on the Data Science Platform engineering team at Bloomberg. With prior NLP experience, he is now building ML platforms to facilitate machine learning development. He is interested in ML infrastructure to enable large-scale training and complex pipelines... Read More →

Yao Weng

Senior Software Engineer, Bloomberg

Yao Weng is a Senior Software Engineer on Bloomberg’s Data Science Platform engineering team. She has contributed extensively to optimizing the company’s Kubernetes environment for high performance compute, model inference, and workflow orchestration. Yao Weng obtained her Ph.D... Read More →

Kubecon NA 2024 Slides Bloomberg's Journey to Improve Resource Utilization in a Multi Cluster Platform pptx

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 250 AD

AI + ML

Content Experience Level Intermediate

2:00pm MST

Testing Kubernetes Without Kubernetes: A Networking Deep Dive - John Howard, Solo.io

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 1 | 155 E

There are few things more tedious than waiting for a long end-to-end test to run. Waiting for a new cluster to spin up, images to build and push - not to mention things like debugging or running on slow internet connections. Unfortunately, these complex setups are hard to avoid, especially if we are testing things deeply integrated into Kubernetes networking, such as CNIs, kube-proxy, services meshes, and more. It doesn't have to be this way! In this talk, I will give a deep dive on how we built out our testing strategy for our Kubernetes networking proxy to not really depend on Kubernetes (or docker, or root). In doing so, I will not only offer a glimpse behind the scenes of Istio development, but also give viewers a deeper understand of how the fundamentals of Kubernetes (Linux primitives like namespaces) work, and how they can be effectively used to improve tests in the Istio ecosystem and beyond.

Speakers

John Howard

John Howard, Solo.io

John Howard is a Senior Architect at Solo.io and Istio Technical Oversight Committee member.

KubeCon NA 24 Testing Kubernetes without Kubernetes pdf

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

2:00pm MST

Object Storage Is All You Need - Justin Cormack, Docker

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 1 | Grand Ballroom A

When Jeff Bezos commissioned Amazon S3 he called it "malloc for the web"; since then many people have considered cloud object storage to be a weird kind of non Posix filesystem, but also a great backing store for websites or storing lots of data. Recently more and more applications are being built with object storage as the entire persistence layer. This started with analytics databases such as Snowflake and Databricks, and the open source Delta Lake and Apache Iceberg projects. More recently the use is spreading to even more applications, from observability to streaming data and more. In this talk we look at why it is becoming so popular, the benefits, downsides and performance characteristics, and how and when to use it effectively.

Speakers

Justin Cormack

CTO, Docker

Justin is the CTO of Docker, recently a member of the CNCF TOC, and has been working in the container ecosystem and in supply chain security for many years.

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

2:00pm MST

Faster Containerized LLM Serving via Knowledge Sharing - Junchen Jiang, University of Chicago & Zhou Sun, Mooncake Labs

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 2 | 255 B

Imagine once an LLM learns something from a document, the knowledge can be instantly shared with other LLMs. Unfortunately, today, LLMs must read the same document multiple times, causing a significant slowdown. This session will introduce a new KNOWLEDGE-SHARING system that enables LLMs to share their digested knowledge, in the form of KV caches, so only one LLM needs to process each document. The key challenge is how to store the KV caches cheaply and serve them quickly. Instead of keeping the KV caches of all reusable chunks in GPU/CPU memory, we show a DEMO that with careful implementation on Kubernetes, storing them on cheaper devices is not only economically superior but also delivers significant reductions in LLM serving delay, especially the time to the first token.

Speakers

Junchen Jiang

Professor, University of Chicago

Junchen Jiang is an Assistant Professor of Computer Science at the University of Chicago. He works at the intersections between networked systems and machine learning. He received his Ph.D. from CMU in 2017 and his bachelor’s degree from Tsinghua in 2011. He has received a Google... Read More →

Zhou Sun

CEO, Mooncake Labs

Mooncake Labs is working on the next generation of stateless data architecture, bringing database performance and functionality to structured and unstructured data in datalakes and raw datasets. Previous I lead the query team at SingleStore (cloud-native distributed HTAP database... Read More →

KCNA24 Talk Slides pdf

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 255 B

Emerging + Advanced

Content Experience Level Intermediate

2:00pm MST

Supercharge Your Kubernetes Autoscaling with Custom Metrics - Vamshi Krishna Samudrala & Sravan Akinapally, American Airlines

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 1 | 155 B

Out-of-the-box, Kubernetes provides native horizontal scaling capabilities driven by conventional resource consumption signals like CPU and memory utilization. However, in the real world, numerous applications demand dynamic scaling orchestrated by custom business telemetry such as queue depths, throughput volumes, or other domain-specific indicators. This session will unravel the secrets of extending Kubernetes' Horizontal Pod Autoscaler (HPA) to leverage custom metrics as scaling triggers, unlocking unprecedented scaling autonomy. Attendees will witness live demos showcasing: Deploying a custom metrics provider to expose application-centric metrics to the Kubernetes control plane Configuring the HPA to consume these custom metrics for intelligent scaling decisions A sample application dynamically scaling based on a custom metric like queue length or requests per second Best practices for crafting bespoke scaling policies tailored to custom metrics.

Speakers

Vamshi krishna Samudrala

Enterprise Cloud Architect, American Airlines

Enterprise Architect with a distinguished career spanning 14 years in the fields of DevOps and Cloud Architecture. Focused on automation, configuration management and innovation with cutting-edge technologies.Worked extensively with leading cloud service providers, including Amazon... Read More →

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

2:00pm MST

Micro-Segmentation and Multi-Tenancy: The Brown M&Ms of Platform Engineering - Jim Bugwadia, Nirmata & Rachael Wonnacott, Fidelity International

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 1 | Grand Ballroom H

A key requirement for internal developer platforms is that they serve multiple workloads. The reality of platform engineering is that while it seeks to lower the barrier to entry for teams to deliver applications, it must also balance cost and ensure appropriate levels of security. It’s therefore essential to consider how application components running on shared infrastructure are allowed to communicate with each other and weigh up the cost of each architecture. In industry, we have seen differing approaches to deploying Kubernetes to achieve these goals, from multiple single-tenant clusters through to shared clusters that deliver namespaces-as-a-service. Rachael and Jim will define the concepts of multi-tenancy and micro-segmentation for cloud native systems, explain why they are critical to success with platform engineering. They will also show real-world examples of how they can be implemented, and demonstrate full automation using best practices like GitOps and Policy as Code.

Speakers

Jim Bugwadia

Co-founder and CEO, Nirmata

Jim Bugwadia is a co-founder and the CEO of Nirmata, the Kubernetes policy and governance company. Jim is an active contributor in the cloud native community and currently serves as co-chair of the Kubernetes Policy and Multi-Tenancy Working Groups. Jim is also a co-creator and maintainer... Read More →

Rachael Wonnacott

Technical Product Owner, Kubernetes Platform, Fidelity International

Rachael has spent the last decade focused on platform engineering. She places a conscious emphasis on improving flow and is on the quest to smooth the application lifecycle for developers in the enterprise. With a background in astrophysics, Rachael brings her scientific approach... Read More →

KCNA24 Brown M&Ms of Platform Engineering Nov 7 2024 pdf

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | Grand Ballroom H

Platform Engineering

Content Experience Level Intermediate

2:00pm MST

The Missing Talk About API Versioning & Evolution in Your Developer Platform - Stefan Schimanski, Upbound & Sergiusz Urbaniak, Independent

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 2 | 251 AD

In the realm of developer platforms, individuals without extensive experience in the cloud-native ecosystem are now venturing into the creation of Kubernetes-based APIs. Tools like Crossplane are transforming every platform engineer into an API designer. Ten years in, the ecosystem still offers little guidance on Kubernetes versioning and API evolution in practice. A naive understanding is not helpful, and many have been burned by relying on intuition. This talk will provide deep, yet applicable knowledge, starting from the first principles of the invariants to maintain when changing APIs in Kubernetes. It will cover tools like schemas, conversion, validation, and admission, and present very concrete and directly applicable API Evolution Patterns. These patterns will help navigate the life cycle of CRD-based projects. This talk aims to educate on how to evolve APIs effectively and safely without inadvertently breaking users.

Speakers

Sergiusz Urbaniak

Team Lead - Kubernetes, https://mongodb.com

Sergiusz is a Kubernetes Team Lead at MongoDB. He is enthusiastic about modern infrastructure software while still enjoying minimalistic networking techniques like morse code. He worked on Mesos, container runtimes, Prometheus Operator, Thanos, upstream Kubernetes, Operators, and... Read More →

Stefan Schimanski

Senior Principal Software Engineer, Upbound

Stefan is a Senior Principal Engineer at Upbound working on control planes, Kubernetes, kcp, and as a tech-lead in Sig API Machinery. He contributed a major part of the CRD feature set. Stefan is a 2nd time GoogleSummer of Code mentor with CNCF, loves to teach and help people to learn... Read More →

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 251 AD

Platform Engineering

Content Experience Level Intermediate

2:00pm MST

The Policy Engines Showdown - Gabriel L. Manor, Permit.io; Andres Aguiar, Okta; Omri Gazitt, Aserto; Pauline Jamin, Agicap; Tyler Schade, Geico; Joy Scharmen, StrongDM

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 2 | 254 B

OPA, Cedar, OpenFGA, Topaz, OPAL, OSO, should I continue? Policy engines, languages, and standards are everywhere, making the decision for a good decision engine increasingly difficult. In this panel, I'll host four talented engineers, each from a different policy engine's core team, for a friendly showdown. We will assist the audience in making the most important decision - choosing a suitable and fitting decision engine for their specific use case. We will also delve into the nuances of running multiple engines together and learn how to scale them properly.

Speakers

Pauline Jamin

Staff Software Engineer, Agicap

Staff software engineer with a love for Domain-Driven Design (DDD) and back-end development. Skilled in leading teams and embracing the Site Reliability Engineering (SRE) philosophy. When not crafting code, you'll find me exploring the great outdoors with my loyal dog. Catch me sharing... Read More →

Tyler Schade

Distinguished Engineer, GEICO

Living in Miami, Florida, I'm an engineering lead at GEICO working on service mesh and traffic management. Prior to joining GEICO, I was at Solo.io, working on multi-cluster service mesh and API gateways. I love learning more about networking and distributed systems and sharing what... Read More →

Joy Scharmen

Senior Director, Infrastructure Engineering, StrongDM

Passionate about infrastructure, and I love learning. Tell me about the great ideas you have for building scalable sustainable humane systems!

Gabriel Manor

Director of DevRel, Permit.io

Gabriel is a senior full-stack developer who blends his passion for technical leadership, security, authorization, and devtools into his current role as the Head of Growth and DevRel at Permit.io. Before joining Permit.io, Gabriel worked as a technical leader and principal engineer... Read More →

Omri Gazitt

Co-founder & CEO, Aserto

Omri is the co-founder/CEO of Aserto, an authorization startup, and his third entrepreneurial venture. He's spent the majority of his 30-year career working on developer and infrastructure technology, most recently as the CPO of Puppet. Previously he was the VP and GM of HP's Cloud... Read More →

Andres Aguiar

Product Manager, Okta

Andres has spent his 20+ year career building tools for developers, wearing different hats. He’s been working on the identity space for the last 6 years, and is currently the Product Manager for OpenFGA.

talk pdf

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 254 B

Security

Content Experience Level Intermediate

2:00pm MST

Tutorial: Simplify and Optimize Your YAML with YAMLScript - Ingy döt Net, YAML LLC

Friday November 15, 2024 2:00pm - 3:30pm MST

Salt Palace | Level 1 | Grand Ballroom G

Nobody likes YAML (or anything for that matter) when its a giant and repetitive mess. Of course, there are already existing technologies like Helm and Kustomize that help provide make YAML nicer for Kubernetes. The new kid on the block is YAMLScript. Being a complete programming language (built over a vast and mature ecosystem) its capabilities are effectively limitless. That said, its primary focus is on refactoring and improving existing and new large YAML configurations. YAMLScript can help you make the most of YAML in any domain; even those that already make great use of Helm and Kustomize. Having been created by an original inventor and current lead maintainer of the YAML data language (Ingy döt Net) you can count on it meshing well with the YAML you already know. In this hands on interactive tutorial, Ingy will teach you how to make the most of YAML and YAMLScript.

Speakers

Ingy döt؜؜ Net

Ingy döt Net, YAML LLC

Ingy döt Net is one of the original inventors of the YAML data language, and its primary maintainer. He has continuously contributed to Open Source efforts since before it was called Open Source. His passion is creating software libraries that work in as many programming languages... Read More →

Friday November 15, 2024 2:00pm - 3:30pm MST
Salt Palace | Level 1 | Grand Ballroom G

Tutorials, SDLC (Software Development Lifecycle)

Content Experience Level Intermediate

2:55pm MST

Thousands of Gamers, One Kubernetes Network - Surya Seetharaman, Red Hat & Girish Moodalbail, NVIDIA Inc

Friday November 15, 2024 2:55pm - 3:30pm MST

Salt Palace | Level 1 | 155 E

Uninterrupted gameplay with minimal network latency, jitter, and maximum throughput is crucial for a great gamer experience. But how do we maintain consistent network quality in cloud gaming production environments at NVIDIA when 2K+ players (pods) share the same physical network for game storage and streaming? When a new player joins and a pod starts downloading large contextual game data, it is vital to shield other players on the same node from this 'noisy neighbor'. Kubernetes provides limited pod-level traffic shaping but we needed more than that. In this talk we will show how we achieved true Quality of Service and wire-speed networking on Kubernetes clusters using Differentiated Services Code Point (RFC7657) markings on pod traffic. Through a live demo that will involve a noisy pod and a victim pod, attendees will gain actionable insights and best practices around packet-parameter-tuned traffic shaping using simple Kubernetes Custom Resources to optimize network performance.

Speakers

Girish Moodalbail

Distinguished Engineer, NVIDIA Inc, NVIDIA Inc

Girish Moodalbail, a Distinguished Engineer at Nvidia Inc., builds Kubernetes-based GPU compute for gaming, AI training, and inferencing with low-latency, high-throughput, reliable, scalable, and secure networking using OSS (OVS, OVN, OVN-K8s CNI) and NVIDIA hardware. With over 22... Read More →

Surya Seetharaman

Principal Software Engineer, Red Hat Inc.

Thousands of Gamers, One Kubernetes Network pdf

Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

2:55pm MST

Object, Block, or File Storage? Choosing the Right Cloud Storage to Integrate Into Kubernetes - Mitch Becker & Tom McDonald, Amazon Web Services (AWS)

Friday November 15, 2024 2:55pm - 3:30pm MST

Salt Palace | Level 1 | Grand Ballroom A

This presentation helps simplify the container storage landscape to assist K8s users make educated cloud storage choices based on their workload requirements and data strategy. You already know K8s is a an open-source platform that orchestrates containerized applications. But what type of cloud storage should one deploy for stateless and stateful applications to ensure persistent data across various operational scenarios? Different storage types cater to specific use cases within K8s environments. Organizations often require persistent storage to run K8s for stateful use cases such as Large-Scale Application Deployment, High-Performance Computing (HPC), AI/ML, Microservices Management, CI/CD Pipelines, and Big Data Processing. Because Block, File, and Object Storage are used in varying ways for containerized workloads, this talk will explain use cases for each storage type and educate the attendees so their selection of storage supports their applications and overall data strategy.

Speakers

Tom McDonald

Sr. Storage Specialist SA, AWS

Tom McDonald is a Senior Workload Storage Specialist at AWS. Starting with an Atari 400 and re-programming tapes, Tom began a long interest in increasing performance on any storage service. With 20 years of experience in the Upstream Energy domain, file systems and High-Performance... Read More →

Mitch Becker

Sr. Storage Specialist, Amazon Web Services (AWS)

Accomplished cloud professional transforming and modernizing IT environments: Cloud Computing, Cloud Storage, HPC, AI, Containers, DevOps, & Cloud Adoption/Migration/Transformation. • CNCF Storage Technical Advisory Group Member • AWS --- Certified Cloud Practitioner, Industry... Read More →

Object, Block, or File Storage for K8s pdf

Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

2:55pm MST

This Platform Goes to 11: Boost Developer Productivity with Lessons from Salesforce - Joe Kutner, Salesforce

Friday November 15, 2024 2:55pm - 3:30pm MST

Salt Palace | Level 2 | 251 AD

Internal platforms play an essential role in boosting the productivity of developers who use cloud native technologies. That’s why Salesforce, a global leader in the cloud for more than two decades, evolved its existing collection of managed services and capabilities into a cohesive platform that delights developers. In this talk, you’ll learn how Salesforce's platform removes friction, unifies interfaces, and meets developers where they are with industry standard tooling. As you design and build your own platforms, you’ll be able to use the same principles that guided Salesforce to accelerate day-1 onboarding of new apps, increase the speed of the developer inner-loop and testing cycles, and reduce the time it takes to deliver new code to production. Our lessons learned will help you avoid missteps. Finally, you’ll learn how to measure developer satisfaction, performance, activity, collaboration, and efficiency to ensure that your platform delivers the most value for your developers.

Speakers

Joe Kutner

Software Architect, Salesforce

Joe is co-founder of the Cloud Native Buildpacks project, which aims to make containerization more secure and more developer friendly. He started the project in 2018 while working as DX Architect at Salesforce Heroku, and today is the DX Architect for Salesforce’s Hyperforce platform... Read More →

This Platform Goes to 11 Boost Developer Productivity with Lessons from Salesforce pptx

Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 2 | 251 AD

Platform Engineering

Content Experience Level Intermediate

4:00pm MST

Divide and Conquer: Master GPU Partitioning and Visualize Savings with OpenCost - Kaysie Yu & Ally Ford, Microsoft

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 2 | 255 E

Kubernetes is the ideal platform for running AI and ML workloads, such as LLMs. GPU nodes are often used for their parallel processing capabilities and higher performance benefits; however, they are known to be costly. Many factors impact the cost of running AI/ML workloads such as GPU utilization, GPU VM size, idle time, etc. These costs are often ignored and considered inherent in running GPU workloads. But if running workloads at scale and left unoptimized, costs will quickly spin out of control. In this talk, we leverage NVIDIA DCGM exporter with Prometheus for GPU metrics monitoring alongside OpenCost to measure the Kubernetes spend of our GPU workloads. We will provide an overview of OpenCost, highlighting its role in bridging the gap between the developer and platform teams through visibility and accountability of spend. We will demonstrate how to use the NVIDIA GPU Operator and how techniques such as partitioning can lead to significant cost savings.

Speakers

Ally Ford

Product Manager, Microsoft

Ally is a Product Manager on the Azure Kubernetes Service (AKS) team at Microsoft Azure. She spends her days collaborating with customers to design features that improve the end to end operator experience for both Linux and Windows users. Formerly she was a UX designer and project... Read More →

Kaysie

Product Manager, Microsoft

Kaysie Yu is a Product Manager on the Azure Kubernetes Service team at Microsoft. She works on cost management and optimization and is passionate about the convergence of FinOps and GreenOps, advocating for best practices that help organizations achieve cost efficiency while contributing... Read More →

Divide and Conquer Master GPU Partitioning and Visualize Savings with OpenCost pdf

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

4:00pm MST

Topology Aware Routing: Understanding the Tradeoffs - Rob Scott, Google

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 1 | 155 E

In Kubernetes 1.31, a new TrafficDistribution field on Services graduated to beta. This is effectively our third attempt at solving Topology Aware Routing in Kubernetes. This talk will tell the story of how we got here and what we learned along the way, outlining what exactly has made this problem so surprisingly complex. With that context, we’ll dive into exactly how Traffic Distribution works today, and when you should configure it. You’ll learn about how it’s implemented today, and how better implementations may be written in the future. We'll walk through some examples to show how it can work well, and when it may not. Finally, we’ll cover how this concept will interact with autoscaling, load balancers, Ingresses, Gateways, and Multi-Cluster Services. You should leave this talk with a clear understanding of how Topology Aware Routing works in Kubernetes, when to use it, and a broad awareness of the work that’s still in progress in this space.

Speakers

Rob Scott

Software Engineer, Google

KubeCon SLC Topology Aware Routing Understanding the Tradeoffs pdf

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

4:00pm MST

The Node Tetris Rabbit Hole: Why Your Binpacking Might Be Underperforming - Hannah Taub, Adobe Inc.

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 1 | 155 B

Have you ever looked at your Kubernetes cluster and thought “I have a perfectly good autoscaler! Why are all my nodes at less than 50% capacity?” When a team moves to the scale of hundreds of clusters with thousands of nodes, efficient binpacking changes from a side task to a financial necessity. From inefficient client apps to long-buried cluster configs, follow the Adobe Ethos team as they track down leads on what’s causing cluster underutilization and how to fix it. You will also learn some tips for designing your clusters to avoid these issues in the first place.

Speakers

Hannah Taub

Ms., Adobe Inc.

As a senior software engineer, Hannah has been working with Adobe’s Cloud Cost Efficiency team for the past several years. After graduating from the University of Edinburgh, she went from writing content APIs at Viacom (now Paramount) to building out Adobe’s Ethos Kubernetes CI/CD... Read More →

node tetris rabbit hole kubecon 2024 pptx

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

4:00pm MST

Migratory Patterns: Making Architectural Transitions with Confidence and Grace - Pete Hodgson, PartnerSlate

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 2 | 255 B

Big technical migrations - like switching databases - can feel like you're swapping out the engine of a bus while continuing to drive down the freeway (with all your users screaming in the back). However, there are ways to make these transitions safe, incremental, low-stress. In this talk we'll walk through a real-world case study of switching a production system from one database to another with no downtime, and no tears, using techniques like Expand/Contract, Dark Launch and Parallel Run. We'll also see hands-on examples of using CNCF open standards like Open Feature and Open Telemetry to manage this migration.

Speakers

Pete Hodgson

CTO, PartnerSlate

Pete Hodgson is an independent software delivery consultant. He helps engineering teams to level up and tackle their thorniest challenges, with a focus on agile engineering practices, architectural evolution, and lean process management. Prior to going independent he spent several... Read More →

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 255 B

SDLC

Content Experience Level Intermediate

4:00pm MST

SPIFFE the Easy Way: Universal X509 and JWT Identities Using cert-manager - Tim Ramlot & Ashley Davis, Venafi

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 1 | Grand Ballroom B

SPIFFE is incredible. Each workload is assigned its own universal identity, simplifying the security and management of communications in distributed systems. While SPIRE (the reference SPIFFE implementation) is exceptionally powerful, it is also quite complex. Deploying SPIRE on Kubernetes requires StatefulSets, which can be challenging and frustrating. Many cloud vendors are starting to offer turnkey SPIFFE solutions, but that comes with risk of vendor lock-in. In this talk, we will demonstrate how to use the Cloud Native cert-manager solution to implement SPIFFE (x509 and JWT) with low operational overhead for all Kubernetes workloads. The session includes all you need to know to issue X.509 SVIDs, use them and validate them. Additionally, we will introduce an experimental solution to convert x509 SVIDs into JWT SVIDs. The demo will highlight how to authenticate to third-party APIs (such as AWS, GCP, Azure, and others) using these JWT SVIDs.

Speakers

Ashley Davis

Staff Software Engineer, Venafi

As a teenager, Ash taught himself to program after wondering how exactly video games were made. That led to adventures trawling through open source codebases, sparking an interest in computers spanning from bare-metal machine code right up to scalable distributed platforms like Kubernetes... Read More →

Tim Ramlot

Senior Software Engineer - cert-manager maintainer, Venafi

Tim started working at Venafi as a software engineer after his graduation as computer science engineer at Ghent University. He learned about cert-manager and Venafi through a Google Summer of Code internship. His mission at Venafi is to advance his problem solving skills, whilst contributing... Read More →

KubeCon NA 2024 spiffe pdf

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | Grand Ballroom B

Security

Content Experience Level Intermediate

4:00pm MST

Why Perfect Compliance Is the Enemy of Good Kubernetes Security - Michele Chubirka, Google

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 2 | 254 B

Technology organizations often struggle over who should manage the security of their Kubernetes environment. This task usually falls to platform or cloud engineering teams, but they often feel abandoned by their security counterparts, uncertain of which requirements will deliver real security value. While published benchmarks and security guides for Kubernetes are helpful, not all recommendations work for every use-case. They may require Kubernetes alpha or beta features which could cause issues with platform stability. Our desire to prioritize “perfect” security over having a functional platform that addresses relevant risks can leave us with nothing, frustrating everyone. Kubernetes is meant to increase application delivery velocity, but when overly strict compliance prevents a team from moving forward, they will subvert security requirements. Let’s stop obsessing over the red in our security and compliance dashboards and focus on what adds real value by reducing risk.

Speakers

Michele Chubirka

Cloud Security Advocate, Google

Michele Chubirka is a recovering Unix and network engineer currently working as a cloud security advocate for Google. She has been an architect, podcaster and freelance writer for various B2B publications such as Network Computing, Dark Reading and TechTarget. She likes long walks... Read More →

KCNA24 perfect compliance enemy good k8s security pdf

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 254 B

Security

Content Experience Level Intermediate

4:55pm MST

Best of Both Worlds: Integrating Slurm with Kubernetes in a Kubernetes Native Way - Eduardo Arango Gutierrez, NVIDIA & Angel Beltre, Sandia National Laboratories

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 2 | 250 AD

It's not always clear which container orchestration system is best suited for a given use case. Slurm, for example, is often preferred over Kubernetes when running large-scale distributed workloads. As a result, organizations areoften faced a hard choice: do they deploy Slurm or Kubernetes to service the rising demands of their AI/ML workloads. In this talk, we introduce K-Foundry, an open-source custom controller for KCP that translates Kubernetes jobs to Slurm jobs and exposes Slurm nodes and cluster info as Kubernetes Custom Resource Definitions (CRDs). This integration combines Slurm’s robust job scheduling with Kubernetes' dynamic orchestration and API-driven ecosystem, easing the administration of both clusters through a common API. This session will end with a live demo, where attendees will see how this integration bridges the gap between cloud and HPC, facilitating resource management and optimizing performance for large-scale AI and LLM tasks.

Speakers

Eduardo Arango Gutierez DE

Senior systems software engineer, NVIDIA

Angel Beltre

Senior Member of Technical Staff, Sandia National Laboratories

Angel Beltre serves as a senior member of the technical staff within the Scalable System Software department at Sandia National Laboratories. He is a contributor to the CSSE Computing-as-a-Service (CaaS) initiative, aimed at streamlining the deployment of modeling and simulation tools... Read More →

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 250 AD

AI + ML

Content Experience Level Intermediate

4:55pm MST

Distributed Multi-Node Model Inference Using the LeaderWorkerSet API - Abdullah Gharaibeh & Rupeng Liu, Google

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 2 | 255 E

Large Language Models have shown remarkable capabilities in various tasks, from text generation to code writing. However, the inference process for these models presents significant challenges. LLMs are computationally intensive, often requiring specialized hardware like TPUs or GPUs to achieve reasonable response times. In some cases their substantial size can strain the resources of a single machine. Specifically, models such as Gemini, Claude, and GPT4 are too large to fit on any single GPU or TPU device, let alone on any single multi-accelerator machine, necessitating what we refer to as multi-node server deployment where a single model server “backend” runs as a distributed process on multiple nodes to harness enough accelerator memory to fit and run the model. This talk presents LeaderWorkerSet, a new k8s API that enables multi-node model inference. We demonstrate its capabilities by orchestrating state of the art model servers such as vLLM and JetStream on both GPUs and TPUs.

Speakers

Abdullah Gharaibeh

Staff Software Engineer, Google

Abdullah is a staff software engineer at Google and sig-scheduling and working group batch co-chair. He works on Kubernetes and Google Kubernetes Engine, focusing on scheduling and batch workloads.

Rupeng Liu

Software engineer, Google

Rupeng Liu, a software engineer from the Google's Kubernetes inference team

LeaderWorkerSet for distributed inference.pptx (1) pdf

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

4:55pm MST

Service Profiling Based Management and Scheduling in K8s - Jia Deng, Cong Xu & Mingmeng Luo, Bytedance

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 1 | 155 B

We present an open-source solution for the efficient management of resources and scheduling strategies in K8s. Our solution constructs workload-specific resource profiles based on their historical utilization patterns. This approach ensures that workloads receive adequate resources while optimizing overall resource utilization. To accomplish this objective, we employ a custom resource Service Profiling Description (SPD), facilitating a direct correlation between workloads and their resource usages, such as deployments and stateful sets etc. Resource utilization metrics, including CPU, disk I/O, and network I/O, are meticulously collected and aggregated. These usage indicators play a pivotal role in informing the scheduler's decisions regarding workloads allocation. This solution has been deployed within large-scale K8s clusters, addressing diverse workload demands, ranging from those requiring dedicated NUMA nodes to those capable of resource sharing among themselves.

Speakers

Mingmeng Luo

Software Engineer, Bytedance

Mingmeng Luo is a software engineer in the Infrastructure Department at ByteDance, where he specializes in the design and development of precision resource management technologies for large-scale Kubernetes clusters. His work focuses on optimizing resource allocation and efficiency... Read More →

Cong Xu

Senior Software Engineer, Bytedance

Cong Xu is a Tech Lead and Senior Software Engineer at ByteDance, where he focuses on building and optimizing the container-based cloud platform that hosts internal products such as Douyin and TikTok. From 2016 to 2022, he served as a Staff Research Member at IBM Research, contributing... Read More →

Jia Deng

Software Engineer, Bytedance

The speaker currently works for bytedance K8s orchestration team. Before that, the speaker worked for amazon EKSA and VMware Tanzu Mission Control.

KCNA24 2024 Service Profiling Based Resource Management and Scheduling pdf

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

4:55pm MST

Zero Downtime Upgrades at Scale: How Okta Manages Hundreds of Clusters Daily - Jérémy Albuixech & Kahou Lei, Okta

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 2 | 251 AD

How do you upgrade your K8s clusters? Perhaps a rolling update of nodes, with services moving around? Can you guarantee a zero-downtime upgrade? Will this method scale and support the velocity of production environments? Likely not. But fear not - you are not alone! At Okta, we maintain hundreds of clusters, each hosting >130 services, with node counts ranging from 20-400 and we are updating them daily. How do we do it? Without an out-of-the-box solutions we had to build our own and we want to share what we learned with all of you! In this talk Kahou and Jeremy will go over the challenges and successes, highlighting how their deployment method provides the foundational blocks to build extra features while reducing the blast radius when something goes wrong - thanks to quick rollbacks and a canary rollouts. In this session attendees will learn how we leverage open source technologies to tackle three main problems: how to scale, how to secure and how to upgrade clusters with no downtime.

Speakers

Jérémy Albuixech

Staff Software Engineer, Okta

Jeremy is a Staff Software Engineer at Okta. Starting as a full stack programmer with a good foundation in Javascript, he then gravitated towards a DevOps role and later became a member of the SRE team at Cisco, picking up an IaC, observability and Kubernetes skillset. With the Okta... Read More →

Kahou Lei

Principal Software Engineer, Okta

Kahou Lei is a Principal Software Engineer with a strong background in Cloud infrastructure and Kubernetes. With 20 years of industry experience, he has held significant positions at renowned companies such as Okta and Cisco. Kahou leads critical software engineering initiatives... Read More →

Zero Downtime Upgrades at Scale How Okta Manages Hundres of Clusters Daily pdf

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 251 AD

Platform Engineering

Content Experience Level Intermediate