KubeCon + CloudNativeCon North America 2024: Full Schedule

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

arrow_back View All Dates

11:15am MST

Architecting Tomorrow: The Heterogeneous Compute Resources for New Types of Workloads - Alexander Kanevskiy, Intel Finland

Wednesday November 13, 2024 11:15am - 11:50am MST

Salt Palace | Level 2 | 254 B

Imagine managing a set of diverse workloads on a Kubernetes node, operating across dozens of CPU cores and several memory zones. But do you truly comprehend the difference between one CPU core versus another? Are you aware of the impact that different memory zone might have on your workload's efficiency? Will optimisations for one type of workloads be helpful for another? Do you think that your ML workload will behave same way as e.g. Redis? This presentation delves deep into CPU internals, memory types (DRAM, HBM, CXL), and diverse cache/core types and layouts. Explore recent hardware advancements and their impact on workloads. We'll examine native compute resource allocation strategies from a hardware point of view, crucial for enhancing workload performance and optimising energy usage and cost efficiency. Join and learn details of the modern hardware architecture that gives you a framework to make more informed choices on hardware resource optimisation for your infrastructure.

Speakers

Alexander Kanevskiy

Principal Engineer, Cloud Orchestration Software, Intel Finland

Alexander is currently employed by Intel as Principal Engineer, Cloud Software, focusing on various aspects in Kubernetes: Resource Management, Device plugins for hardware accelerators, Cluster Lifecycle and Cluster APIs. Alexander has over 25+ years of experience in areas of Linux... Read More →

NA2024 Architecting Tomorrow The Heterogeneous Compute Resources for New Types of Workloads pdf

Wednesday November 13, 2024 11:15am - 11:50am MST
Salt Palace | Level 2 | 254 B

Emerging + Advanced

Content Experience Level Intermediate

12:10pm MST

Operationalizing High-Performance GPU Clusters in Kubernetes: Lessons Learned from Training Databricks DBRX - Will Gleich & Wai Wu, Databricks

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 2 | 255 E

Training large language models (LLMs) on GPUs within Kubernetes environments involves significant configuration and complexity, often leading to unique failure scenarios. This presentation will cover the lessons learned from training DBRX, a state-of-the-art LLM, that we developed on a 400-node cluster with a primary workload utilizing 3072 GPUs and the tooling needed to measure and maintain a healthy fleet of nodes and underlying interconnect fabric.
This will include:

How we implemented GPU health detection leveraging Prometheus and DCGM Exporter
How we monitor GPU Direct Remote Direct Memory Access (GDRDMA) and the challenges of monitoring components that bypass CPU
Discussion of failure scenarios during training, and how they were addressed Databricks Mosaic AI Training leverages GPU clusters across many cloud providers to maximize availability; we will also discuss the variations we see and how we had to engineer around them.

Speakers

Wai Wu

Software Engineer, Databricks

Hey everyone, I was born to deploy Kubernetes ꈍ ω ꈍ

Will Gleich

Sr. DevOps Engineer, Databricks

Will Gleich is a Sr. DevOps engineer at Databricks specializing in MLOps and Site Reliability Engineering.

Operationalizing High Performance GPU Clusters in Kubernetes Lessons Learned from Training Databricks DBRX pdf

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

12:10pm MST

Can Your Kubernetes Network Handle the Heat? Building Resilience with AI Chaos - Lior Lieberman, Google & Surya Seetharaman, Red Hat

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 1 | 155 E

Kubernetes networking is complex with many APIs, numerous configurations and potential failure points. In the rapidly evolving world of cloud-native applications, ensuring your Kubernetes network can withstand unexpected failures is not just an advantage—it is a necessity. In this talk Surya and Lior, holding distinct leadership roles in Gateway API and NetworkPolicy API, will demonstrate how you can leverage AI-powered Chaos Engineering to stress test Gateways, NetworkPolicies, and Services on a live cluster! They will share their experiences and lessons learned from using Litmus and enhancing K8sGPT to design and execute AI Chaos experiments, as well as focusing on how you can proactively find gaps and bottlenecks in the network infrastructure. This is a great opportunity to learn from real-world disruption scenarios and participate in a collaborative discussion on how we can leverage AI to build robust Kubernetes Networks.

Speakers

Surya Seetharaman

Principal Software Engineer, Red Hat Inc.

Surya is an Open Source advocate and contributor, active in the Kubernetes SIG-Network working group. She is working as a Principal Software Engineer at Red Hat in the OpenShift Networking team. Her areas of interest include Cloud Infrastructure and Networked Services and Systems... Read More →

Lior Lieberman

Site Reliability Engineer, Google

Lior is site reliability engineer at Google working on Google Compute Engine. He is a leading maintainer of ingress2gateway, and an active contributor to Kubernetes SIG network focused on Gateway API.

Can Your Kubernetes Network Handle the Heat Building Resilience with AI Chaos pdf

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

12:10pm MST

Automated Multi-Cloud, Multi-Flavor Kubernetes Cluster Upgrades Using Operators - Ziyuan Chen, Databricks

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 1 | 155 B

Databricks manages over a thousand k8s clusters across three major cloud providers which run critical workloads in cloud regions around the world. This talk describes the system we built to upgrade nodes’ operating system, k8s version, and other configs monthly, supporting EKS, AKS, GKE, and self-managed k8s. Our system is built on k8s operators and performs zero-downtime blue-green rolling updates, respects contracts with services with features like PDBs, maintenance windows, deferred node draining, and custom workload handling plugins. It enables easy rollbacks, has good observability, and incurs minimal human operational cost. This has allowed us to patch vulnerabilities and release infrastructure changes quickly and reliably across the fleet. We will also share our lessons learned on building several operators that work together using the controller-runtime framework, designing the declarative interfaces between them, and achieving consistent behavior across three clouds.

Speakers

Ziyuan Chen

Staff Software Engineer, Databricks

Ziyuan Chen is a software engineer at Databricks. He has worked on Databricks' compute and OS infrastructure areas.

Automated Multi Cloud, Multi Flavor Kubernetes Cluster Upgrades Using Operators pptx

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate
Presentation Slides Attached Yes

12:10pm MST

Automated Multi-Cloud Large Scale K8s Cluster Lifecycle Management - Sourav Khandelwal, Databricks

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 1 | Grand Ballroom H

I will present the system developed for cluster rotations across Databricks’ fleet of over a thousand cloud-managed k8s clusters on AWS, Azure, and GCP. Blue-green cluster rotations, or cluster swaps (upgrading by creating a new k8s cluster with a new version/configuration & shifting workloads from the old cluster), allow us to implement major infrastructure changes and upgrade k8s versions with low risk through staged rollouts, seamless rollbacks, zero downtime, and minimal operator intervention. Our system includes a k8s-style continuous reconciliation mechanism to manage cluster swap lifecycles, a fast and reliable cluster state change discovery system, and a k8s workload migration system. We will share methodologies and experiences in constructing this loosely coupled system that orchestrates product workloads and cloud provider APIs for automated cluster swaps. This session will explore the challenges faced, and the benefits of automating large-scale, multi-cloud k8s upgrades.

Speakers

Sourav Khandelwal

Sr. Software Engineer, Databricks

I am a seasoned software engineer with over 10 years of experience in designing and managing large-scale platforms in cloud-native environments. At Databricks, my significant contributions have been pivotal in launching our next-generation cloud infrastructure that helped to transition... Read More →

Automated Multi Cloud Large Scale K8s Cluster Lifecycle Management pdf

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | Grand Ballroom H

Platform Engineering

Content Experience Level Intermediate

12:10pm MST

The Hard Truth About GitOps and Database Rollbacks - Rotem Tamir, Ariga

Wednesday November 13, 2024 12:10pm - 12:45pm MST

Salt Palace | Level 2 | 250 AD

For two decades now, the common practice for handling rollbacks of database schema migrations has been pre-planned "down migration scripts". A closer examination of this widely accepted truth reveals critical gaps that result in teams relying on risky, manual operations to roll back schema migrations in times of crisis. In this talk, we show why our existing tools and practices cannot deliver on the GitOps promise of "declarative" and "continuously reconciled" workflows and how we can use the Operator Pattern to build a new solution for robust and safe schema rollbacks.

Speakers

Rotem Tamir

CTO, Ariga

Rotem Tamir (39), father of two. Co-founder and CTO of Ariga, co-maintainer of Atlas and Ent. Ex-data platform architect at Nexar, infrastructure team lead at ironSource.

Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

2:30pm MST

Optimizing LLM Performance in Kubernetes with OpenTelemetry - Ashok Chandrasekar, Google & Liudmila Molkova, Microsoft

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 2 | 255 E

Large Language Models are increasing in popularity and their deployments on Kubernetes have steadily increased. LLM applications bring new usage patterns that the industry does not have the expertise in. At the same time, there is a lack of observability in these deployments which makes it difficult to debug performance issues. We will present an end to end walkthrough of how you can leverage client and server LLM observability using Open Telemetry based on the recent efforts in the Kubernetes and Open Telemetry communities to standardize these across LLM clients and model servers. We will also demonstrate how to troubleshoot a real-world performance issue in your LLM deployment and how to optimize your LLM server setup for better performance on Kubernetes. We'll show how to use Kubernetes autoscaling based on custom model server metrics and demonstrate how they offer a superior alternative to using GPU utilization metrics for such deployments.

Speakers

Liudmila Molkova

Principal Software Engineer, Microsoft

Liudmila Molkova is a Principal Software Engineer at Microsoft working on observability and Azure client libraries. She is a co-author of distributed tracing implementations across the .NET ecosystem including HTTP client instrumentation and Azure Functions. Liudmila is an active... Read More →

Ashok Chandrasekar

Senior Software Engineer, Google

Ashok Chandrasekar is a Senior Software Engineer at Google working on AI/ML experience for Google Kubernetes Engine. Previously he was a Staff Engineer at VMware where he led the cluster lifecycle management area for Tanzu Mission Control. He has 7 years of Kubernetes experience working... Read More →

optimizing llm performance pdf

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

2:30pm MST

AIStore as a Fast Tier Storage Solution: Enhancing Petascale Deep Learning Across Cloud Backends - Abhishek Gaikwad & Aaron Wilson, NVIDIA

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Grand Ballroom A

As deep learning continues to evolve, the demand for handling petascale datasets efficiently becomes paramount. Current cloud storage solutions often struggle with the speed (throughput) and cost-effectiveness required for these massive datasets, particularly due to the random access needs of machine learning workloads. This talk introduces AIStore (AIS) as a fast-tier storage solution designed to overcome these challenges by offering a fast, scalable, cost-effective tier for deep learning data. AIS features linear scalability with each added storage node - in fact, with each added drive. In this presentation, we will explore the architecture and benefits of AIStore, focusing on its linear scalability and high performance. This session will feature detailed benchmarks and use cases comparing the performance of accessing cloud datasets with and without AIStore, highlighting AIS's ability to deliver high per-GPU throughput and stable latencies.

Speakers

Aaron Wilson

Senior Software Engineer, NVIDIA

I'm a developer on the AIStore project, focused mainly on our Kubernetes deployments and Python SDK for training workflows. Most recently, I've been working on the AIS K8s Operator and Helm charts, improving AIStore deployment and lifecycle management. My past experience has been... Read More →

Abhishek Gaikwad

Software Engineer, NVIDIA

Abhishek Gaikwad is a Software Engineer at NVIDIA with a Master of Science degree in Computer Science from San Jose State University. As a key developer and maintainer of AIStore, Abhishek has played a crucial role in its design, development, and management. His contributions include... Read More →

KCNA24 AIStore.pptx pdf

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

2:30pm MST

Unifying Observability: Correlating Metrics, Traces, and Logs with Exemplars and OpenTelemetry - Kruthika Prasanna Simha & Charlie Le, Apple

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Grand Ballroom B

In modern distributed systems, observability is key to understanding application performance and behavior. While metrics, traces, and logs each provide valuable insights, their true power is realized when they are correlated. This talk will dive into the practical benefits and implementation of correlating these signals with exemplars using the OpenTelemetry SDK and Collector, and showcase the results in Grafana. Attendees will learn how to leverage OpenTelemetry to create exemplars which will allow them to navigate from either logs or metrics to their traces.

Speakers

Kruthika Prasanna Simha

Senior Software Engineer, Apple

Kruthika is a software engineer at Apple specializing in building ML enabled observability solutions. She holds a Masters in Computer Engineering and has specialized in Machine Learning. In her free time, she likes to dabble with Jupyter Notebooks for running experiments with data... Read More →

Charlie Le

Senior Software Engineer, Apple

Charlie is a software engineer at Apple, specializing in building and scaling cloud native observability solutions and infrastructure. Deeply inspired by the collaborative spirit of open source, he actively contributes to projects like Cortex and OpenTelemetry, shaping the future... Read More →

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom B

Observability

Content Experience Level Intermediate

2:30pm MST

Does My K8s Application Need CPR? Performance Evaluation of a Multi-Cluster Workload Management App - Braulio Dumba & Ezra Silvera, IBM

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | 155 B

KubeStellar (KS) is an open-source Kubernetes multi-cluster workload configuration management system that can be used to manage AI workloads in multi-cluster environments. Hence, understanding KS performance is crucial especially when managing resource intensive AI workloads. In this talk, we will present our experience in analyzing the performance metrics of KS across several dimensions of scalability (e.g., number of bindingPolicies, workload description spaces and number of managed remote clusters) and challenges that arise when conducting performance experiments in a multi-cluster environment. Our insights will demonstrate the utility of benchmarking the performance of a multi-cluster Kubernetes workload management application. Additionally, in this talk, we will demonstrate the usefulness of using several opensource tools such as clusterloader2, kube-burner & kwok to evaluate the performance of multi-cluster Kubernetes management applications.

Speakers

Ezra Silvera

Senior Technical Staff Member, IBM

Ezra Silvera is a Senior Technical Staff Member at IBM Research. His interests include distributed systems, cloud management, and cloud infrastructure. Ezra is passionate about open-source technologies and has been involved in several notable open source projects such as Docker, KubeVirt... Read More →

Braulio Dumba

Staff Research Scientist, IBM

Dr. Braulio Dumba is a Staff Research Scientist at IBM Research. In 2018, he joined IBM under the Hybrid Cloud organization. His current research is focus on edge computing and hybrid cloud computing. Dr. Dumba earned a Ph.D. in Computer Science from University of Minnesota, Twin... Read More →

KCNA24 KS Braulio Ezra pdf

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

2:30pm MST

Better Pod Availability: A Survey of the Many Ways to Manage Workload Disruptions - Zach Loafman, Google

Wednesday November 13, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Grand Ballroom H

Kubernetes Pods are ephemeral, but some are more ephemeral than others. Kubernetes provides a dizzying array of options to manage and handle Pod disruption. From PodDisruptionBudgets, to "safe-to-evict" annotations, GracefulTermination timeouts and more, it can be incredibly hard to determine the optimal solution for handling Pod disruption and how to manage gracefully terminating your application. Thankfully, due to the extensible nature of Kubernetes we can build CRDs and controllers that can simplify these complex topics for end users. In this talk, we'll present an in-depth analysis of the built-in options and how they work (or don't). While this problem is not unique to game-serving, we'll deep-dive and explain how Agones (an open-source session orchestration system layered on Kubernetes) solves this problem with a simple abstraction to hide the complexity!

Speakers

Zach Loafman

Staff Software Engineer, Google

Zach leads Google’s GKE Games team. He was previously lead of the Kubernetes Control Plane team for GKE, lead of the GKE Cluster Lifecycle team, worked on Kubernetes prior to GA, and was one of the founding members of the Google Kubernetes Engine team.

KCNA24 Better Pod Availability pdf

Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom H

Platform Engineering

Content Experience Level Intermediate

2:30pm MST

Tutorial: Confidential Containers 101: A Hands-on Workshop - Archana Choudhary & Suraj Deshmukh, Microsoft

Wednesday November 13, 2024 2:30pm - 4:00pm MST

Salt Palace | Level 1 | Grand Ballroom G

Here is how you can be prepared for the workshop, to follow along: https://github.com/surajssd/kubecon-na24-workshop/blob/main/preparation.md

As traditional enterprises with stringent data protection requirements become cloud-native and migrate to Kubernetes on public clouds, they are wondering: “Is my data secure on this shared hardware? Can someone with a host access snoop on my data?” And especially, with the upcoming Digital Operational Resilience Act (DORA) in Europe mandating data protection in use, it’s crucial for users to familiarize themselves with solutions like Confidential Containers (CoCo), a CNCF sandbox project. In this, first of its kind, hands-on workshop we’ll dive deep into using CoCo with k8s. We’ll explore real-world challenges, such as ensuring data confidentiality from platform owners (cloud providers), and show you how to overcome them. Through practical exercises, you’ll learn to set up CoCo and secure your containerized workloads, turning theory into practice. Attendees will discover streamlined practices, find robust protection mechanisms, and gain strategic insights into adopting CoCo.

Speakers

Suraj Deshmukh

Senior Software Engineer, Microsoft

Suraj has worked with Kubernetes since version 1.3. He organized the Kubernetes Bangalore meetup and helped bring Kubernetes to the masses. To make Kubernetes easier has worked earlier on projects like Kompose, which converted docker-compose to Kubernetes artifacts. He has spoken... Read More →

Archana Choudhary

Ms, Microsoft

A software engineer who has been exploring cloud-native technologies, particularly focusing on confidential containers over the past several months.

Kubecon NA 2024 Confidential Containers 101 pptx

Wednesday November 13, 2024 2:30pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom G

Tutorials, Security

Content Experience Level Intermediate

3:25pm MST

A Tale of 2 Drivers: GPU Configuration on the Fly Using DRA - Alay Patel & Varun Ramachandra Sekar US, Nvidia

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 2 | 255 B

NVIDIA’s GeForceNow is a cloud gaming service that allows users to stream video games from NVIDIA's servers to a wide range of devices, including PCs, Macs, Android devices, iOS devices, and smart TVs. Under the hood, it is powered by Kubernetes running Kubevirt VMs. For a seamless user experience, GeForceNow dynamically switches GPU drivers to accommodate either passing through an entire GPU or slicing it into multiple virtual GPUs, all while keeping utilization close to 100% across the datacenter. This poses significant challenges when using the traditional device plugin API provided by Kubernetes. In this talk, we explore GeForce Now’s journey to transition away from the traditional device plugin API in favor of Dynamic Resource Allocation (DRA). We'll share valuable insights for anyone looking to perform a similar migration of their own. Join us to learn about the challenges, solutions, and best practices to help optimize your GPU-accelerated workloads in the cloud.

Speakers

Alay Patel

Senior Software Engineer, Nvidia

Alay is a Senior Software Engineer at Nvidia where he works on scaling up the workloads running on GPU Compute infrastructure. He is passionate about open source with a focus on Kubernetes and platform engineering.

Varun Ramachandra Sekar US

Senior Software Engineer, Nvidia

Developer by day, Dog whisperer by night.

KCNA24 A Tale of 2 Drivers GPU Configuration on the Fly Using DRA (2) pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 B

AI + ML

Content Experience Level Intermediate

3:25pm MST

Using OpenTelemetry for Deep Observability Within Messaging Queues - Shivanshu Raj Shrivastava & Ekansh Gupta, SigNoz

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | Grand Ballroom B

The recent changes in OpenTelemetry have made new semantic conventions and changes in agents to better monitor messaging queues such as Kafka, RabbitMQ, and Amazon SQS, etc. In this session, we'll discuss how those semantic conventions are standardizing the telemetry collected from producers, consumers, and the messaging queues, and how in-depth observability can be achieved by correlating producer-to-consumer spans with the metrics collected from Kafka. Additionally, We will demonstrate how the Kafka Java client side instrumentation enabled and JMX metrics collected from Kafka how OpenTelemetry instrumentation can help for metrics to trace and trace to metrics correlation and spot reasons for anomalies like increased consumer lag, partition failures, time taken by messaging queues. This will also help in giving the corresponding traces in time that can help end users to better delve into their infrastructures and optimize their asynchronous applications.

Speakers

Ekansh Gupta

SDE, SigNoz

Ekansh is a Software Development Engineer with SigNoz, with active involvement in various open-source and cloud native communities for upwards two years now. He was previously an SDE Intern at SteamLabs. He is also a speaker for a couple of talks at PyCon, KubeCon and MozFests. Ekansh... Read More →

Shivanshu Raj Shrivastava

Founding Engineer, SigNoz

Shivanshu is a Founding Engineer at SigNoz, working on building an OTeL native observability product. He has a keen interest in deep tech and OSS. He is a CNCF ambassador and a member of CNCF projects like OTeL, k8s, and Istio. He has got the opportunity to mentor contributors in... Read More →

Using OTel for Deep observability within messaging queue pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom B

Observability

Content Experience Level Intermediate

3:25pm MST

Setting New Standards for Reliability in Cloud Native Multi-Region Applications - Trey Caliva, Global Payments

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | 155 B

As a multinational FinTech provider, processing over 32 billion card transactions for 816 million accounts, Global Payments requires globally available architectures with quick disaster recovery while maintaining subsecond latencies. In addition, these workloads require strict adherence to compliance standards. This session will explore the high-level architectural decisions implemented in a cloud-native redesign and cloud migration of a mission critical legacy .NET application. Key cloud native tools leveraged include Kubernetes on GCP, and the use of CockroachDB as a cloud native database solution. We will explore how leveraging these cloud native technologies achieved extreme fault tolerance in a multi-region deployment, setting new standards for performance and reliability.

Speakers

Jim Hatcher

Solutions Engineer, Cockroach Labs

Trey Caliva

Principal Cloud Architect, Global Payments

Trey Caliva is an Architect and engineer with 10+ years of hands-on experience planning, developing, managing, and securing deployments in Google Cloud and AWS. He is currently Principal Cloud Architect at Global Payments, a Fortune 500 company and a member of the S&P 500 focused... Read More →

KCNA24 GP Setting New Standards for Reliability in Cloud Native Applications pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

3:25pm MST

Scale Job Triggering with a Distributed Scheduler - Cassie Coyle & Artur Souza, Diagrid

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 2 | 250 AD

Imagine scheduling thousands or millions of jobs that are persisted and triggered timely and resilient to downtime. Some jobs might be triggered every second while others need to reliably be triggered on the first day of the month. Achieving high throughput and reliability is critical for the performance and operational efficiency of modern distributed systems. How can traditional cron job scheduling be extended? How can distributed systems handle job scheduling with minimal downtime? What challenges arise when scaling job scheduling to thousands or millions of jobs? In this session, Artur and Cassie will delve into the design of Dapr’s distributed Scheduler and how users can start using it today. You will gain a comprehensive understanding of how Dapr’s Scheduler unblocks scalability of actors and workflows while also enabling new capabilities, like delayed pubsub and schedule job API.

Speakers

Artur Souza

Head of Engineering, Diagrid

I am a maintainer of Dapr since 2019, helped the project reach the 1.0 stable version and keeping frequent releases since then. Currently Head of Engineering at Diagrid, leading the engineering teams building Conductor and the next generation of managed cloud native APIs via Dapr... Read More →

Cassie Coyle

Software Engineer, Diagrid

Cassie, a devoted software engineer at Diagrid actively contributes to Dapr, focusing on Go backend development to simplify the creation of resilient, event-driven, and microservices-based apps. She is a member of the Dapr Day and AppDeveloperCon 2024 program committees. Her work... Read More →

ScaleJobTriggeringWithADistributedScheduler pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

3:25pm MST

CEL-Ebrating Simplicity: Mastering Kubernetes Policy Enforcement - Kevin Conner, Getup Cloud & Anish Ramasekar, Microsoft

Wednesday November 13, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 1 | 151 G

As Kubernetes deployments grow increasingly complex, robust policy enforcement is crucial. The Common Expression Language (CEL) provides a powerful solution, enabling the creation of sophisticated, human-readable expressions for Kubernetes policies. This session explores CEL's integration with Kubernetes, simplifying policy definition and enforcement. Key takeaways: - Fundamentals of CEL and its Kubernetes integration. - Practical use cases for CEL in admission control, resource management, and security. - Enhancing policy expressiveness and flexibility with CEL. - Introduction to CEL Playground for testing and validating CEL expressions. Through live demos, learn to leverage CEL and CEL Playground for streamlined policy management in Kubernetes. Ideal for administrators, developers, and DevOps professionals, this session equips you to enhance your Kubernetes policies using CEL. Join us to discover how CEL and CEL Playground can transform your Kubernetes policy management.

Speakers

Anish Ramasekar

Principal Software Engineer, Microsoft

Anish Ramasekar is a software engineer at Microsoft. He is on the Azure Container Upstream team building features for Kubernetes upstream and various CNCF projects that are part of the Azure Kubernetes Service. Anish is a maintainer of the Secrets Store CSI Driver project.

Kevin Conner

Chief Engineer, Getup Cloud

Kevin Conner is the Chief Engineer at GetUp Cloud, a startup focused on Kubernetes and DevSecOps. He has worked at startups like Integrated Micro Products, Arjuna Technologies, JBoss, and Aviatrix, as well as Sun Microsystems and Red Hat where he led teams for Cloud Enablement, Service... Read More →

CEL Ebrating Simplicity Mastering Kubernetes Policy Enforcement pdf

Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

4:30pm MST

Making Kubernetes Simpler for Accelerated Workloads - Susan Wu, Google; Lucy Sweet, Uber; Mitch McKenzie, Weave; Aditya Shanker, Crusoe; Rebecca Weekly, Geico

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 2 | 255 B

Kubernetes and the open-source ecosystem for AI frameworks have been great for LLM innovation, empowering developers to build applications that use natural language as the interface to data. Yet, many developers and cluster operators struggle to put these frameworks into production use. In this session, hear from several platform engineers responsible for designing core infrastructure supporting accelerated workloads, services, large language model training and inference pipelines. You can expect to come away with guidance, hear of pitfalls to watch out for and learn how they successfully abstracted the infrastructure complexity to improve their research users' experience and velocity. Panelists include: Lucy Sweet, Senior Software Engineer (Infrastructure), Uber, Mitch McKenzie, Site Reliability Engineer - Machine Learning Operations, Weave, Susan Wu, Outbound Product Manager, Google

Speakers

Rebecca Weekly

Geico

Susan Wu

Outbound Product Manager, Google

Susan is an Outbound Product Manager for Google Cloud, focusing on GKE Networking and Network Security. She previously led product and technical marketing roles at VMware, Sun/Oracle, Canonical, Docker, Citrix and Midokura (part of Sony Group). She is a frequent speaker at conferences... Read More →

Lucy Sweet

Senior Software Engineer at Uber, Uber

Lucy is a Senior Software Engineer at Uber Denmark who works on software infrastructure

Mitch McKenzie

Weave

Aditya Shanker

Senior PM, Crusoe

Aditya is a Senior PM at Crusoe Cloud, responsible for building out managed orchestration and platform services

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 B

AI + ML

Content Experience Level Intermediate

4:30pm MST

Platform Performance Optimization for AI - a Resource Management Perspective - Antti Kervinen, Intel & Dixita Narang, Google

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 2 | 255 E

How much node resource management can affect AI workload performance? What options are there? What is the trade-off between total throughput and low latencies? In this talk we take a systematic approach to Platform Performance Optimization. We walk through the whole path from goal setting, gathering data, analysis, visualizations and conclusions. At each stop along the path we share our practical experiences in a case of LLM inference optimization. You will find many considerations, findings and practical tricks to take away. For instance, how to instrument PyTorch without touching the source or a container image, how to enable changing what we are measuring without new expensive benchmark reruns, and how much more we can learn from visualizations compared to numeric averages and percentiles. Finally we share real results from our case: how resource management increased total token throughput per worker node by more than 3.5x from the baseline.

Speakers

Antti Kervinen

Cloud Orchestration Software Engineer, Intel

Antti Kervinen is a Cloud Orchestration Software Engineer working at Intel, whose interest in Linux and distributed systems has led him from academic research of concurrency to the world of Kubernetes. When unplugged, Antti spends his time outdoors discovering wonders of nature.

Dixita Narang

Software Engineer, Google

Dixita Narang is a Software Engineer at Google on the Kubernetes Node team. With a primary focus on resource management within Kubernetes, Dixita is deeply involved in the development and advancement of the Memory QoS feature, which is currently in the alpha stage. She is a new contributor... Read More →

kubecon 2024na platform performance pdf

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

4:30pm MST

From Observability to Performance - Nadia Pinaeva, Red Hat & Antonio Ojea, Google

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 1 | 155 E

No matter how fast the Services on your Kubernetes cluster are, users would love them to be faster. But how do you get from a huge pile of metrics across a distributed system to real user experience improvements? There is a way, and with the right tools and the right approach, you can better understand and evaluate Service performance. In this talk, you'll learn how to identify the performance parameters that directly translate to user experience. We will explore how to collect performance metrics from running Kubernetes clusters without disrupting normal operations using tools like Prometheus, Grafana, kube-burner, and custom instrumentation. We will discuss how to translate the collected metrics and analysis into concrete actions and how to identify bottlenecks and implement optimizations to enhance Service performance. This talk is ideal for k8s networking developers, administrators, SREs, DevOps engineers, and anyone responsible for managing or optimizing Kubernetes networking.

Speakers

Antonio Ojea

Software Engineer, Google

Antonio Ojea is a Software Engineer at Google, where he works on Kubernetes. He is one of the top contributors of the Kubernetes project, with a stronger presence on the areas of networking and reliability. He has a vast experience in Open Source, networking and distributed systems... Read More →

Nadia Pinaeva

Senior Software Engineer, Red Hat

Nadia Pinaeva is a Senior Software Engineer at Red Hat working on Openshift Networking. She collaborates with the SIG-network-policy to improve network security for Kubernetes clusters, and works on ovn-kubernetes network plugin.

From Observability to Performance pdf

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

4:30pm MST

Experience in Designing & Implementing a Cloud Native Framework for Farm Data Analytics - Braulio Dumba, IBM & Gloire Rubambiza, Cornell University

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 2 | 254 B

This work is based on 17 months experience managing a digital agriculture platform that has aggregated and processed tens of gigabytes of data on 1500 cows on a commercial dairy farm. Significant challenges surfaced tied to multi-cluster management, fault-tolerance, and privacy as the number of applications and farm management models grew. To bridge this gap, we designed and implemented a cloud native networked system for multi-cluster configuration and management of farm data analytics that leverages KubeStellar and Software-Defined Farm paradigm. Our experience from designing, implementing and deploying this framework showcase how Kubernetes can enable farmers and agribusinesses to leverage the power of containerization and cloud-native computing to optimize workflows and streamline agricultural operations. This work presents progress towards cloud-native, scalable, and fault-tolerant data analytics in digital farming with potential environmental, financial, and societal impacts.

Speakers

Braulio Dumba

Staff Research Scientist, IBM

Gloire Rubambiza

Ph.D. Candidate, Cornell University

Dr. Gloire Rubambiza is a postdoctoral associate in CS at Cornell University, where he conducts research in hybrid cloud computing for digital agriculture with an emphasis on societal impact. At Cornell, he was a University Fellow, a fellow of NSF National Research Traineeship in... Read More →

KCNA24 KS SDF Braulio Gloire 11102024 pdf

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 254 B

Emerging + Advanced

Content Experience Level Intermediate

4:30pm MST

Perform Laser Focused Deployments by Deciding in Advance the Blast Radius - Kostis Kapelonis, Octopus deploy

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 2 | 250 AD

Progressive Delivery is an advanced deployment method that allows for zero-downtime application releases. Argo Rollouts is a Kubernetes controller that allows you to adopt progressive delivery in the form of blue/green and canary deployments. We see a lot of teams that choose an arbitrary number of clients that access the new version of a canary. Yes, it is very easy to send only 10% of the traffic to the new version of a Kubernetes deployment. But sometimes you want to choose WHICH 10% sees the new traffic. In this talk we will see several approaches on pinning down specific clients to the old or new version and advanced scenarios for sending canary traffic only to a specific subset of users such as internal employees or customers who have expressed their interest on seeing brand new releases as soon as possible.

Speakers

Kostis Kapelonis

Developer Advocate, Codefresh by Octopus Deploy

Kostis is a software engineer/technical-writer dual class character. He lives and breathes automation, good testing practices and stress-free deployments with GitOps.

Laser focused deployments pdf

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

4:30pm MST

Expanding the Capabilities of Kubernetes Access Control - Jimmy Zelinskie, authzed & Lucas Käldström, Upbound

Wednesday November 13, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 1 | 151 G

Kubernetes RBAC is an effective way of managing ACLs in one cluster. However, there are many other effective paradigms out there, such as Attribute- & Relation-based Access Control. In this talk, we’ll demystify how these differ, and when to use respective paradigms, giving context and guidance. We’ll highlight how Kubernetes access control has recently evolved towards supporting lots of different use-cases. We take this opportunity to cover multiple perspectives: security within a single cluster (zooming in) and security within real-life production environments with external services and multiple clusters (zooming out). As containers became ubiquitous first with excellent tools like Docker, we believe the same can and will happen for access control, yielding uniform, interoperable and understandable authorization. Finally, we'll propose future work that could be done to supercharge Kubernetes and ensure it keeps up with the ever increasing security requirements in our industry.

Speakers

Lucas Käldström

Senior Software Engineer, Upbound

Lucas is a Kubernetes and cloud native expert who has been serving the CNCF community in lead positions for 6 years. He’s awarded Top CNCF Ambassador 2017 with Sarah Novotny. Lucas was a co-lead for SIG Cluster Lifecycle, co-created kubeadm, Weave Ignite, and ported Kubernetes to... Read More →

Jimmy Zelinskie

Co-founder, authzed

Jimmy Zelinskie is a software engineer and product leader with a goal of democratizing software via open source development. He's currently CPO of authzed where he's focused on bringing hyperscaler best-practices in authorization to the industry at large. At CoreOS, he helped pioneer... Read More →

Expanding the Capabilities of Kubernetes Access Control pdf

Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

4:30pm MST

Tutorial: Get the Most Out of Your GPUs on Kubernetes with the GPU Operator - Eduardo Arango Gutierrez, Tariq Ibrahim, Amanda Moran & Christopher Desiniotis, NVIDIA; David Porter, Google

Wednesday November 13, 2024 4:30pm - 6:00pm MST

Salt Palace | Level 1 | Grand Ballroom G

NVIDIA’s GPU operator has become the de-facto standard for managing GPUs in Kubernetes at scale. This tutorial provides in-depth, hands-on training on the various GPU sharing techniques that are possible with the GPU operator. Participants will learn to deploy jobs utilizing these sharing techniques, as well as get hands-on experience on the installation and configuration of the NVIDIA GPU Operator itself. This includes an in-depth exploration of its two primary CRDs: ClusterPolicy and NVIDIADriver. These CRDs are essential for configuring GPU-accelerated nodes, enabling GPU sharing mechanisms, and performing GPU driver upgrades. The session will culminate with practical use cases, such as training an AI/ML model and giving participants firsthand experience in managing a GPU-accelerated Kubernetes cluster.

Speakers

Christopher Desiniotis

Senior Systems Software Engineer, NVIDIA

Christopher Desiniotis is a Senior Systems Software Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator, a widely used tool for managing GPUs in Kubernetes, and is focused on increasing... Read More →

David Porter

Staff Software Engineer Google, Google

David Porter is a Staff Software Engineer at Google on the Kubernetes node team. David’s focus is on the kubelet node agent and the resource management area. He is primary maintainer of cAdvisor, a resource monitoring library widely used in kubernetes, reviewer of a SIG Node, and... Read More →

Eduardo Arango Gutierez DE

Senior systems software engineer, NVIDIA

Eduardo is a Senior Systems Software Engineer at NVIDIA, working on the Cloud Native Technologies team. Eduardo has focused on enabling users to build and deploy containers on distributed environments.

Tariq Ibrahim

Senior Software Engineer, NVIDIA

Tariq Ibrahim is a Senior Cloud Platform Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator. He has also contributed to several cloud native OSS projects like kube-state-metrics, Istio... Read More →

Amanda Moran

https://www.nvidia.com/en-us/, NVIDIA

Amanda has been working in technology since graduating from SCU in 2012 with a Master’s in Science in CS. Prior to this she had graduated with an BS in Biology from UW. Amanda has worked the last 12 years as a Software Engineer, a Solutions Architect, and an Engineering Manager... Read More →

Wednesday November 13, 2024 4:30pm - 6:00pm MST
Salt Palace | Level 1 | Grand Ballroom G

Tutorials, AI + ML

Content Experience Level Intermediate

5:25pm MST

Building Resilience for Large-Scale AI Training: GPU Management, Failure Detection, and Beyond - Ganeshkumar Ashokavardhanan, Microsoft & Ace Eldeib, Cohere

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | 155 E

As AI training scales to thousands of GPUs across hundreds of machines, hardware failure becomes an expensive risk. From GPU faults to network performance degradation, undetected problems can sabotage training jobs, inflating costs, and slowing development. This talk dives into failure and orchestration challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on experience from cloud providers and training large language models, we will share best practices for efficient identification, remediation, and prevention of GPU failures.

Speakers

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft

Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →

Ace Eldeib

Staff Software Engineer, Cohere

Ace is a Staff Software Engineer at Cohere working on training and serving infrastructure for large language models. Prior to that, he worked on Azure Kubernetes service and ran self-managed Kubernetes for other Azure services.

Upload Kubecon Resilience for AI Training pdf

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 155 E

AI + ML

Content Experience Level Intermediate

5:25pm MST

Production AI at Scale: Cloudera’s Journey in Building a Robust Inference Platform - Zoram Thanga & Peter Ableda, Cloudera

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 2 | 255 E

In this session, we talk about Cloudera AI Inference Service, a secure, large scale platform for generative AI and predictive inference workloads, built using state of the art Kubernetes, CNCF and Apache open source projects. We take the audience through our journey in building this platform and share the experiences we gained along the way. The platform is built using openness, security, scalability, performance and standards compliance as guiding principles. We demonstrate that it is possible to be open and secure at the same time, and that organizations can incorporate production grade AI inferencing into their Big Data environments. This session will cover the architecture of the platform, and explain how we handle performance, scaling, authentication, fine grained authorization and audit logging, all of which are critical considerations for production inferencing.

Speakers

Peter Ableda

Director, Product Management, Cloudera

Peter Ableda is the Director of Product Management for Cloudera’s AI product suite, bringing over a decade of experience in data management and advanced analytics. Holding a Master of Science degree in Computer Science from the Budapest University of Technology, Peter has dedicated... Read More →

Zoram Thanga

Principal Engineer, Cloudera

Zoram is a Principal Engineer, Enterprise AI Platform in Cloudera. He has been working in the software industry for over 23 years, and has been involved in building clustering software, containers, file systems, analytical query engines, and ML/AI platforms. He is a committer in the... Read More →

production ai at scale.pptx pdf

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

5:25pm MST

Creating Paved Paths for Platform Engineers - Ritesh Patel, Nirmata; Abby Bangser, Syntasso; Viktor Farcic, Upbound; Nicholas Morey, Akuity; Praseeda Sathaye, Amazon

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | Grand Ballroom H

The platform engineering team's role has evolved into a pivotal one as the custodian of the internal developer platform. However, these teams often find themselves in a quagmire of identifying the right components to include in their platforms, particularly in the ever-expanding CNCF landscape. This panel session discusses these challenges by exploring the concept of 'Paved Paths' as a strategic approach to guide platform teams in their journey of building an internal developer platform (IDP). 'Paved Paths' offers a solution by providing platform engineering teams with proven reference architectures (e.g. CNOE and the BACK Stack). This approach prevents them from starting from scratch and getting lost in the vast CNCF landscape. By offering proven and opinionated reference architectures, platform teams can focus on enhancing developer experiences and optimizing higher-level workflows rather than grappling with the complexities of identifying foundational components for their IDP.

Speakers

Viktor Farcic

Developer Advocate, Upbound

Viktor Farcic is a lead rapscallion at Upbound, a member of the CNCF Ambassadors, Google Developer Experts, CDF Ambassadors, and GitHub Stars groups, and a published author. He is a host of the YouTube channel DevOps Toolkit and a co-host of DevOps Paradox.

Ritesh Patel

Co-Founder & VP Product, Nirmata

Ritesh Patel is Co-founder and leads Products at Nirmata, the creators of Kyverno. At Nirmata, he is responsible for commercial products for security and operations (SecOps) automation powered by policy as code. He also leads key technology partnerships. Ritesh has 20+ years of experience... Read More →

Praseeda Sathaye

Principal Specialist Solution Architect, Amazon (AWS)

Praseeda Sathaye is a Principal Specialist SA for App Modernization and Containers at Amazon Web Services based in Bay Area California. She has been focused on helping customers speed their cloud-native adoption journey by modernizing their platform infrastructure, internal architecture... Read More →

Nicholas Morey

Senior Developer Advocate, Akuity

Nicholas Morey is a Platform Engineer with a passion for DevOps practices. He is on the team at Akuity as a Developer Advocate, working with the community on anything Argo and Kargo-related. He is an experienced Argo CD operator and a Certified Kubernetes Administrator.

Abby Bangser

Principal Engineer, Syntasso

Abby is a Principal Engineer at Syntasso delivering Kratix, an open-source cloud-native framework for building internal platforms on Kubernetes. Her keen interest in supporting internal development comes from over a decade of experience in consulting and product delivery roles across... Read More →

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | Grand Ballroom H

Platform Engineering

Content Experience Level Intermediate

5:25pm MST

Taming Your Application’s Environments - Marcos Lilljedahl, Dagger & Mauricio "Salaboy" Salatino, Diagrid

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 2 | 250 AD

How coupled are your applications code and pipelines to its target cloud or on-prem environment? Kubernetes helps us to abstract how we run our workloads. However, there are other aspects, like infrastructure dependencies, service configuration, build process, deployment descriptors, etc., which need to be considered to make an application portable across multiple environments. Focusing on these aspects make a big difference when migrating apps to reduce costs, meeting compliance requirements or leveraging a specific tech only available somewhere else. Join us to cover three techniques you can implement to level up your SDLC: - Modularizing and enhancing our delivery pipelines to simplify complex environments (Crossplane and Dagger) - Building consistent experiences around well-known interfaces (CloudEvents, Dapr, and OpenFeature) to minimize runtime drift. - Design with separation of concerns to enable fast feedback loops between development and operation teams (Argo CD, Knative)

Speakers

Marcos Lilljedahl

Software Engineer, Dagger

Dad, Docker Captain, OSS lover, helmsman and wine drinker. Father of a joyful kid and wannabe surfer. I like listening to jazz music and tinker with some fun projects when possible. Avid open source contributor.

Mauricio Salatino

OSS Software Engineer, Diagrid

Mauricio works as an Open Source Software Engineer at @Diagrid, contributing to and driving initiatives for the Dapr OSS project. Mauricio also serves as a Steering Committee member for the Knative Project and Co-Leading the Knative Functions initiative. He published a book titled... Read More →

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 2 | 250 AD

SDLC

Content Experience Level Intermediate

5:25pm MST

From Observability to Enforcement: Lessons Learned Implementing eBPF Runtime Security - Anna Kapuścińska & Kornilios Kourtis, Isovalent

Wednesday November 13, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 1 | 151 G

eBPF is getting widely adopted in cloud native runtime security tools like Falco, KubeArmor, and Tetragon. Using eBPF we can collect relevant security events right in the kernel and pass them to Security Engineers for retroactive attack detection and response. Having reliable and complete visibility is great, but wouldn't it be even better to proactively prevent attacks in progress? This talk covers the Tetragon team’s experience moving from security observability to enforcement and lessons learned along the way: from defining security models to hardening interactions between the local kernel and distributed Kubernetes systems. It will deep dive into how eBPF-based enforcement works, why it differs from observability, and the challenges of implementing it. The audience will walk away understanding the inner workings and common pitfalls of eBPF-based runtime security.

Speakers

Kornilios Kourtis

Software Engineer, Isovalent at Cisco

I am a software engineer at Isovalent, working on cloud-native networking, security, and observability using eBPF. Before that, I worked in industrial (IBM) and academic research (ETH Zurich, NTU Athens) in systems, including operating systems, storage and network stacks, and high-performance... Read More →

Anna Kapuscinska

Software Engineer, Isovalent at Cisco

Anna is a software engineer at Isovalent, focusing on eBPF-based observability and security. Her previous roles span the industry: she wore both developer and SRE hats, and worked in AdTech, FinTech, public healthcare, end-user SaaS company and a hosting provider. On good weather... Read More →

Kubecon 2024 NA From Observability to Enforcement Lessons Learned Implementing eBPF Runtime Security pdf

Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS01): Climatik: Cloud Native Sustainable LLM via Power Capping - Chen Wang, IBM & Vincent Hou, Bloomberg L.P.

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

As GenAI workloads grow, the need for advanced accelerators with higher power consumption is surging. NVIDIA GPU peak power has risen from 300W for V100 to 1000W for B100. However, current power infrastructure and cooling systems are not designed to handle rapid power increases, leading to challenges like limited accelerator deployment in some regions or overheating risks that could cause fire hazards. We propose Climatik, a dynamic power capping system that enables data center and cluster admins and developers to set power caps dynamically at the cluster, service namespace, and rack levels. Climatik leverages Kepler for observability and offers APIs for integration with Kubernetes control knobs, including autoscalers, schedulers, and queuing systems, to ensure power caps are maintained across all levels. We will demo how to use Climatik to configure power capping for a large language model (LLM) inference service on KServe and show how power capping influences KEDA on autoscaling.

Speakers

Chen Wang

Senior Research Scientist, IBM

Chen Wang is a Staff Research Scientist at the IBM T.J. Watson Research Center. Her interests lie in Kubernetes, Container Cloud Resource Management, Cloud Native AI systems, and applying AI in Cloud system management. She is an open-source advocate, a Kubernetes contributor, and... Read More →

Vincent Hou

Senior Software Engineer, Bloomberg L.P.

Vincent Hou is a Chinese software engineer, who used to study in Belgium and is currently working in US. He has been an active open source contributor, since 2010. He used to be an active contributor to Cinder project, OpenStack block storage service, and a core committer of OpenWhisk... Read More →

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, AI + ML

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS02): 0.0.0.0 Day: Exploiting Localhost APIs from the Browser - Mic McCully, Oligo

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

Browser-based attacks are not new in the malicious landscape of attack patterns. Browsers remain a popular infiltration method for attackers. While seemingly local, services running on localhost are accessible to the browser using a flaw we found, exposing the ports on the localhost network interface, and leaving the floodgates ajar to remote network attacks. In this live demo and attack simulation we’ll unveil a zero-day vulnerability (still under responsible disclosure) in Chrome and other browsers, and how we use the 0-day to attack developers behind firewalls. We will demonstrate remote code execution on a wildly popular open-source platform serving millions in the data engineering ecosystem, that seems to run on localhost. In our talk, we will present novel attack techniques, targeting developers and employees within an organization, that are behind firewalls. This will be a first-ever deep dive into this newly discovered zero-day vulnerability.

Speakers

Mic McCully

Field CTO, Oligo Security

Poster slide Kubecon (Mic) pdf

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Security

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS03): Unleashing the Power of Init and Sidecar Containers in Kubernetes - Carlos Sanchez & Natalia Angulo, Adobe

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

This session dives deep into the power of init and sidecar containers, the issues they solve and why they are very useful when managing Kubernetes workloads. We will explore real-world use cases that show how these tools can: * Simplify complex deployments: Break down intricate deployments into manageable steps. * Enhance security: Isolate security critical tasks within your pods and ongoing security measures. * Facilitate rapid and isolated changes: when everyone is interested in updating the same service, separation of concerns is critical for rapid development. * Boost application functionality: Utilize sidecar containers to inject essential functionalities like logging, monitoring, and networking capabilities without modifying your main application code. Our goal is to share our experience and challenges managing thousands of environments in Kubernetes, how we manage init and sidecar containers and what problems they solve for us.

Speakers

Natalia Angulo

Software Developer Engineer, Adobe

Natalia Angulo is a Software Development Engineer at Adobe Experience Manager, contributing to Site Reliability tasks and the development of new features inside AEM, and specially helping with their infrastructure management. She is passionate about maths, coding puzzles and teaching... Read More →

Carlos Sanchez

Principal Scientist, Adobe

Carlos Sanchez is a Principal Scientist at Adobe Experience Manager, specializing in software automation, from build tools to Continuous Delivery and Progressive Delivery. Involved in Open Source for over 20 years, he is the author of the Jenkins Kubernetes plugin and a member of... Read More →

Poster Unleashing the Power of Init and Sidecar Containers KubeCon NA 2024 pdf

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Operations + Performance

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS04): Optimizing Pod Affinity in Kubernetes: A Mathematical Approach to Workload Placement - Jack Xue, Microsoft

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

A standout feature of Kubernetes is its sophisticated mechanism for pulling container images from repositories, aligning containers with the appropriate pods, and strategically deploying pods to nodes that meet their resource requirements—such as CPU, GPU, RAM, network, and storage. This process adheres to the defined affinity and anti-affinity specifications between pods and nodes. Despite these capabilities, the challenge of optimally arranging a multitude of workloads, each comprising several pods within a cluster, remains an ongoing endeavor. In our research, we illustrate that a set of YAML files, which detail a workload deployment request, can be systematically transformed into a Binary Integer Linear Programming (BILP) model. Depending on the specific optimization goals, the objective functions of the model can be tailored accordingly. With the imposition of broad conditions, it is feasible to derive an optimal solution that adheres to polynomial time complexity constraints.

Speakers

Jack Xue

Principal Cloud Solution Architect, Microsoft

PhD & MBA. Principal Cloud Solution Architect, Microsoft

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Platform Engineering

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS06): What's Happening with SPIFFE and WIMSE? - Daniel Feldman, Qusaic

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

This session will be a very brief overview of what's going on with the SPIFFE and WIMSE identity standards projects. SPIFFE is a CNCF effort to standardize workload identity implementations. That is, a SPIFFE implementation can grant services unique identities and credentials. WIMSE is an IETF effort to build on the SPIFFE foundation. In particular, it adds a new, unique token format that allows securely recording multi-hop identity information. Implementors will be able to use this token format to build complete, end-to-end, cryptographically auditable identity records.

Speakers

Daniel Feldman

Founder, Qusaic

Daniel Feldman has worked with many companies, large and small, to deploy SPIFFE and SPIRE zero-trust identity.

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Security

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS07): Unleashing the Power of Prediction to Proactively Scale Control Plane Components - Anubhav Aeron & Ryan Tay, Intuit

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

At Intuit, our control plane components such as IstioD are responsible for hundreds of applications per cluster. It is responsible for configuring data plane, as well as injecting the istio-proxy container. With an increase in application traffic, there is an increase in application pods, which results in the control plane to scale up. For critical control planes such as IstioD, it is wise to scale proactively, rather than as a reaction to increase in load. With traditional approaches, like tuning HPA thresholds, to scale in advance, we might pre scale even when not required due to outliers, which could be wasteful. At Intuit a novel deep learning forecasting model called N-HiTS was employed to solve this issue. This session will discuss and demo how we train N-HiTS, our most important model features, and how we deploy our service on a per-cluster basis to provide contextualized predictions for cost effective and on time auto-scaling.

Speakers

Anubhav Aeron

Senior Staff SE, Coupang

Anubhav is a seasoned software engineer in the field of Cloud Native Technologies, and has been doing Kubernetes and Service Mesh since 2016. He developed Redis Cluster as a Service, and a Templating Engine while working at Yahoo! He is the lead maintainer of Admiral, which is an... Read More →

Ryan Tay

Software Engineer, Intuit Inc.

As a software engineer on the Service Mesh team at Intuit, Ryan works to support Intuit's extensive Istio deployment through contributions to projects like Admiral. He has previously worked to reduce costs of cloud development environments for the Intuit API Gateway team. His main... Read More →

Kubecon IstioD Predictor pdf

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Platform Engineering

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS09): Kubernetes as a Geographically Distributed System - Chris Friesen, Wind River Systems

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

Kubernetes was designed to be the best container orchestration platform on top of a cloud infrastructure in one data center. What do you do when you want to take your deployment and grow it in various geographical locations, but sill keep it as part of one system? You will have to face with complexity and figure out infrastructure management on a massive scale, and neither of these is easy to tackle. However, you don't have to go back to the drawing board, because the platform that delivers on requirements and expectations, already exists and it is called StarlingX. The StarlingX project is a fully integrated, open source cloud platform that is running in production at large telecom operators, who rely on its distributed cloud architecture along with next-level container orchestration support, which is provided by Kubernetes. This talk will introduce the StarlingX platform, share highlights from its latest release and show how it takes Kubernetes to the next level!

Speakers

Chris Friesen

Member of Technical Staff, Wind River Systems

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Operations + Performance

Content Experience Level Intermediate

6:00pm MST

🪧 Poster Session (PS10): Accepting Mortality: Strategies for Ultra-Long Running Stateful Workloads in K8s - Sebastian Beyvers & Maria Hansen, Giessen University

Wednesday November 13, 2024 6:00pm - 8:00pm MST

Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

"Pods are mortal" is a well-known quote in the official Kubernetes documentation. For ultra-long running stateful workloads that take months to complete, this mortality comes with its own challenges. How do you react to hardware failures? What resource quotas are appropriate? What if the workload has no built-in persistence and does all its work in memory? For such workloads, failures can be fatal, potentially wiping out months of work. This session will show that despite all the obstacles, Kubernetes can still be a reasonable choice for running stateful workloads that take months to complete. Using real-world examples based on production workflows, we will show how we design, configure, run, and operate such workloads using K8s and Argo workflows. We will also show how intelligent checkpointing using CRIU can help us deal with failures and enables us to avoid some problems even before they occur.

Speakers

Sebastian Beyvers

Distributed Systems Researcher, Giessen University

Sebastian Beyvers is a distributed systems researcher in bioinformatics and a cloud-native Rust developer at Giessen University. Sebastian's current work focuses on cloud-native data storage and processing solutions that try to harmonize existing national and international data ecosystems... Read More →

Maria Hansen

Research Associate, Giessen University

Maria Hansen is a research assistant in the field of (bio)informatics at Justus Liebig University Giessen. She is currently working on a cloud-native data orchestration system that aims to unite existing national and international data ecosystems.

PS10 Accepting Mortality pdf

Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

🪧 Poster Sessions, Emerging + Advanced

Content Experience Level Intermediate