KubeCon + CloudNativeCon North America 2024: Full Schedule

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

arrow_back View All Dates

11:00am MST

Better Together! GPU, TPU and NIC Topological Alignment with DRA - John Belamaric, Google & Patrick Ohly, Intel

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 250 AD

AI/ML workloads on Kubernetes demand ultra-high performance. If your training or multi-GPU inference job spans nodes, your GPUs will use the network, talking through a NIC over local PCIe. But not all NICs are equal! To get the best performance, you need a NIC which is as "close" to the GPU as possible. Unfortunately, the Kubernetes extended resources API does not have enough information and does not give you control over which specific devices are assigned. Dynamic Resource Allocation, the successor API, gives you this power. Come to this session to learn about DRA, how it is improving overall device support in K8s, and how to use it to allocate multiple GPUs, NICs, and TPUs to get the maximum performance out of your infrastructure.

Speakers

Patrick Ohly

Principal Engineer, Intel

Patrick Ohly is a software engineer at Intel GmbH, Germany. In the past he has worked on performance analysis software for HPC clusters ("Intel Trace Analyzer and Collector") and cluster technology in general (PTP and hardware time stamping). Since January 2009 he has worked for Intel... Read More →

John Belamaric

Senior Staff Software Engineer, Google

John is a Sr Staff SWE, co-chair of K8s SIG Architecture and of K8s WG Device Management, helping lead efforts to improve how GPUs, TPUs, NICs and other devices are selected, shared, and configured in Kubernetes. He is also co-founder of Nephio, an LF project for K8s-based automation... Read More →

[PUBLIC] 2024 KubeCon NA Better Together! GPU, TPU and NIC Topological Alignment with DRA pdf

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 250 AD

AI + ML

Content Experience Level Intermediate

11:00am MST

Securing Outgoing Traffic: Building a Powerful Internet Egress Gateway for Reliable Connectivity - Edie Yang & Akshita Agarwal, Airbnb

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 1 | 155 E

Concerned about secure and reliable outgoing traffic from your organization's mesh network? With the increasing demand to use external vendor apis for LLMs, along with vulnerabilities like Log4j, the need for preventing data exfiltration and maintaining strong safeguards is critical. But managing access to multiple external domains within the service mesh can be daunting. Discover the secrets behind building a powerful Internet Egress gateway using Istio and Envoy. This enlightening talk unveils a way to define fine-grained access policy to monitor and audit outgoing traffic from your mesh network. Besides, it demonstrates how to build a generic multi-tenant gateway that can be used across heterogeneous services and save years of repeated engineering work. By the end of the talk, attendees will gain an understanding of what an Internet Egress Gateway is, why it is necessary, and how they can configure it for their own services using the open-source Istio/Envoy based solution.

Speakers

Akshita

Senior Software Engineer, Airbnb

Akshita is a Senior Software Engineer at Airbnb working in the Service Mesh team which the handles interservice networking at scale. She currently is focused on designing a secure network edge solution at Airbnb. Previously she worked at Microsoft developing the Nginx Load Balancer... Read More →

Edie Yang

Senior Software Engineer, Airbnb

Edie is a Senior Software Engineer at Airbnb on the Cloud Infrastructure team which develops the Service Mesh system that powers the entire Airbnb stack. Edie has been working on developing service mesh API, service migration automation, Google IAP-based ingress gateway and internet... Read More →

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

11:00am MST

How We Scale a Distributed SQL Database to 1 PB - Jinpeng Zhang, PingCAP

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 1 | Grand Ballroom A

TiDB is a distributed SQL database that we built to solve the scalability problems of traditional SQL databases such as MySQL and PostgreSQL. Using TiDB, users do not need to shard their data across multiple MySQL or PostgreSQL database instances, nor do they need to sacrifice some key database features such as JOIN and transactions. Users only need to add storage nodes and computing nodes to the cluster as needed. However, we also encountered many scalability challenges when building TiKV - the stateful storage layer of TiDB. Challenges such as workload skew issues making it difficult to scale performance, management challenges of millions of dynamic data partitions, latency impact during scaling, interference between different workloads when consolidating multiple workloads into the same cluster, etc. In this talk, I will provide an in-depth look at these challenges and our solutions.

Speakers

Jinpeng Zhang

Director of Engineering, PingCAP

Director of Engineering at PingCAP, TiKV maintainer and committer, RocksDB contributor, the author of "MariaDB Principles and Implementation". Mainly engaged in the design and development of cloud-native large-scale distributed storage systems, data platforms, 10+ years of experience... Read More →

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

11:00am MST

Upgrade Safely: Avoid the Pitfalls of Kubernetes Versioning - Rob Scott, Google

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 254 B

Have you ever upgraded a cluster or controller only to realize everything was broken due to some kind of versioning mismatch? Do you remember the pain of upgrading to a new Kubernetes API version like Ingress v1? Do you get a little twinge any time you see a feature or API deprecated in release notes? This is the talk for you. Kubernetes versioning is surprisingly complex and widely misunderstood. This talk will cover all the relevant versioning concepts, from storage versions to feature gates. It will show how they interact with each other, and how you can use this information to safely and confidently upgrade your clusters and controllers. This talk will provide real examples of how versioning mixups can lead to broken clusters and downtime. You’ll learn exactly how you can avoid each of these potential failure modes, and gain some insights into how API and Controller authors are trying to minimize the impact of these kinds of changes in the future.

Speakers

Rob Scott

Software Engineer, Google

Rob is an open source enthusiast currently working on Kubernetes Networking at Google. He's been a maintainer of Gateway API since the very early days of the project and led the development of other Kubernetes networking APIs like EndpointSlices.

KubeCon SLC Upgrade Safely Avoid the Pitfalls of Kubernetes Versioning pdf

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 254 B

Operations + Performance

Content Experience Level Intermediate

11:00am MST

Share the Ride: Robust Multi-Tenancy in Kubernetes at Uber - Sashank Appireddy & Apoorva Jindal, Uber

Friday November 15, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 251 AD

Multi-tenancy in Kubernetes involves the coexistence of multiple users or teams (tenants) on a single Kubernetes cluster while ensuring isolation, security, and performance. Our use cases at Uber span from scenarios with disruptive neighbors to those with large container sizes, specialized hardware, sticky placement preferences, and dynamic resource scaling demands, necessitating robust isolation measures. In this proposal, we present a comprehensive exploration of multi-tenancy in Kubernetes, covering strategies, the challenges we have faced and the effective solutions implemented to overcome them at Uber. Further, we will deep dive into the key aspects of building and managing multi-tenant Kubernetes clusters, by establishing strong tenant boundaries leveraging the ideas around node pools and tightly integrating with namespaces.

Speakers

Apoorva Jindal

Senior Staff Software Engineer, Uber Inc

Apoorva Jindal is working as Senior Staff Software Engineer at Uber. At Uber, he leads the Compute platform which powers all stateless and batch containerized workloads at Uber.

Sashank Reddy

Staff Software Engineer, Uber Technologies Inc

I am software engineer with over a decade of experience specializing in containerization and distributed systems. As a Staff Software Engineer in the container platform team at Uber Technologies Inc, I lead the design, development and deployment of scalable multi-tenant architecture... Read More →

KubeCon2024 MultiTenancy pdf

Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 251 AD

Platform Engineering

Content Experience Level Intermediate

11:55am MST

Improving Service Availability: Scaling Ahead with Machine Learning for HPA Optimization - Avni Sharma & Estela Ramirez, Intuit

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 2 | 255 E

In this talk, we will explore employing machine learning (ML) algorithms to enhance the Kubernetes autoscaling capabilities beyond the traditional, reactive horizontal pod autoscaler (HPA). Attendees will be introduced to how to leverage recommendation algorithms to predict future load and usage patterns, allowing for smarter, proactive scaling decisions. This approach not only ensures high availability and responsiveness of applications but also offers a pathway to substantial cost optimizations by preventing over-provisioning and minimizing resource wastage.

Speakers

Avni Sharma

Product Manager, Intuit

Avni is a Product Manager at Intuit, working on Intuit’s Modern SaaS Kubernetes platform. She also worked on ArgoCD as a PM. Avni is passionate about Developer tooling and strives to make developers' life easier by delivering them delightful experiences. She is also an Open Source... Read More →

Estela Ramirez

Software Engineer, Intuit Kubernetes Service, Intuit

Estela is a Software Engineer at Intuit focusing on Intuit Kubernetes Developer Platform. She works on abstracting the autoscaling for developers.

Improving Service Availability Scaling ahead pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate
Presentation Slides Attached Yes

11:55am MST

Seeing Double? Implementing Multicast with eBPF and Cilium - Louis DeLosSantos, Isovalent at Cisco

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | 155 E

Multicast is a popular networking technology used in finance, telecommunications, and media CDNs, among others to efficiently replicate and deliver data streams to multiple clients. However, this advantage can be overshadowed by the complexity involved in configuring the necessary infrastructure leaving the overworked platform team rather than the end users seeing double. To combat this complexity, Cilium explored using eBPF to implement pod-to-pod multicast delivery within a Kubernetes cluster. This talk will provide both a high and low level understanding of how eBPF can be used to implement multicast delivery. It will discuss how Cilium’s multicast works and the hurdles faced by the project along the way. By the end of this talk the audience will have a better understanding of how multicast functions, how eBPF can be used in-place of traditional multicast infrastructure, and how Cilium can be used as a multicast-enabled CNI, letting your audience - and not you- see double.

Speakers

Louis De Los Santos

Louis DeLosSantos, Isovalent at Cisco

Louis DeLosSantos is a multi-disciplined technologist who has worn network, systems, and software engineer hats at various times. Presently he works at Isovalent at Cisco where he focuses on Linux Kernel networking and implementing eBPF datapath networking solutions.

Seeing Double? Implementing Multicast with eBPF and Cilium pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

11:55am MST

Kubernetes on Multisites – A Story About Stateful App, Hybrid Clouds, and High Availability - Florian Coulombel, Dell Technologies & Jan Šafránek, Red Hat

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | Grand Ballroom A

The day has come! Kubernetes has won the hearts and minds of your leadership and entire organizations, and everyone wants to benefit. Projects are launched to migrate legacy apps, run proprietary systems, and even use virtual machines in your Kubernetes infrastructure! But wait a minute. VMs and good' ol RDBMS are not microservices developed with 12 factors in mind where data is either hosted on an external service or replicated by the application. How are we going to warranty the availability of these applications and systems? Do I need to do a backup of these things? What if my business is fragmented across edge, on-prem, and public clouds? Members from SIG Storage will guide you through the options to compose with, including the latest CSI features, Kubernetes architecture design, and even hardware solutions. We will evaluate the benefits to consider and the pitfalls to avoid when implementing stateful workloads in Kubernetes on multiple sites.

Speakers

Jan

Software Engineer, Red Hat

Jan is a Senior Principal Software Engineer at Red Hat working on storage aspects of Kubernetes. He started developing Kubernetes more than 8 years ago, and is one of the founding members of SIG-Storage. He’s the author of PersistentVolume controller, dynamic provisioning and StorageClass... Read More →

Florian Coulombel

Senior Software Engineer, Dell Technologies

Father of 2, living in France. Nerd since 1996 when Quake alpha version leaked, Linux user since 2001, Kubernetes enthusiast since 2016, member of Kubernetes SIG Storage since 2023.

KCNA24 k8s on multisites pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

11:55am MST

Love thy (Noisy) Neighbor: Strategies for Mitigating Performance Interference in Cloud-Native Systems - Jonathan Perry, PerfPod

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | 155 B

In cloud-native environments, application performance often degrades due to contention over shared resources such as CPU caches and memory bandwidth. Current container technologies lack mechanisms to isolate these resources, which compels operators to maintain low utilization by scaling out their deployments. This session explores strategies used by hyperscalers like Google, Microsoft, Facebook, and Alibaba to mitigate such performance interference. We will review their published methodologies, extracting key principles that could guide the development of a Kubernetes-native performance isolator. Participants will gain insights into the design trade-offs and operational impacts of these tools. Additionally, we will discuss integration strategies for deploying such isolators in existing Kubernetes environments, aiming to optimize resource utilization while preserving application performance.

Speakers

Jonathan Perry

Founder & CEO, PerfPod

Jonathan Perry is a maintainer of the OpenTelemetry eBPF network collector. His PhD research at MIT CSAIL focused on performance isolation in datacenter and cloud networks, aiming to enhance network efficiency and reduce latency. Jonathan founded Flowmill, where he developed eBPF-based... Read More →

Slides Kubecon NA'24 Love thy (Noisy) Neighbor pdf

Transcript and Slides Love thy (Noisy) Neighbor pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

11:55am MST

What Containerd 2.0 Means for You - Samuel Karp, Google

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 2 | 254 B

containerd 2.0 is the first major new version of containerd since 1.0.0 was released in 2017. This new version of containerd introduces new features, new extension points, and new backends for image operations and CRI with the goal of increased flexibility and better efficiency for certain types of workloads. containerd 2.0 also removes some previously-deprecated features in favor of modern replacements. This talk will discuss how to prepare for containerd 2.0 in your production environments, including strategies for incorporating containerd 2.0's new functionality and detecting/remediating any impact of removed features prior to upgrading.

Speakers

Samuel Karp

Staff Software Engineer, Google

Samuel Karp is a containerd maintainer and a Staff Software Engineer at Google, focused on the container runtime for Google Kubernetes Engine. Sam has been involved in the container ecosystem since 2014 and serves as the Chair of the Open Container Initiative's Technical Oversight... Read More →

[Export] What containerd 2.0 means for you pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 254 B

Operations + Performance

Content Experience Level Intermediate

11:55am MST

Still Don't Do What Charlie Don't Does - Making CRD Changes Safer - Nick Young, Isovalent

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 2 | 251 AD

Many Kubernetes installations use controllers that include Custom Resource Definitions (CRDs) to extend their capabilities. However, because CRDs can only have one version installed in a cluster at any one time, version and change management can be very difficult. This talk will benefit both controller implementers and users. For implementers, I have tips on how to more safely make API changes to their CRDs, and for CRD users, some tips on what to look out for when installing CRD updates. All of this is based on using experience from projects like Contour, Gateway API, and Cilium among others. Learn things like: Different CRD version management strategies - what’s worked and what hasn’t How to make schema changes like pluralizing a field or changing field validation in a safe way How not to make the same mistakes I did Expect to come away from this talk having learned from my painful experiences handling CRD changes badly, but also having heard a bunch of Simpsons references.

Speakers

Nick Young

Senior Software Engineer, Isovalent at Cisco

Nick has been working to prevent the entropic downfall of systems for 25 years, across datacenters, clouds, networking, and others. He's a Staff Engineer at Isovalent, and a maintainer on the Kubernetes Gateway API project, where he works on improving the ingress and mesh experiences... Read More →

Still Don't Do What Charlie Don't Does.pptx pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 251 AD

Platform Engineering

Content Experience Level Intermediate

11:55am MST

Rogue No More: Securing Kubernetes with Node-Specific Restrictions - Anish Ramasekar, Microsoft & James Munnelly, Apple

Friday November 15, 2024 11:55am - 12:30pm MST

Salt Palace | Level 1 | 151 G

Did you know that a component running across multiple nodes, such as in a daemonset, intended to perform node-specific actions, can pose a significant security risk? If any node the component is running on goes rogue, it can lead to attacks on the cluster, or even worse, a complete takeover of it. What if we could restrict the component's ability to write resources only to those belonging to the node it is running on to prevent such escalation attacks? In this talk, Anish and James will introduce new Kubernetes security enhancements to bound service account tokens, which can be used with validating admission policies to enforce per-node restrictions on service accounts. This session will provide you with practical implementation guidelines and show you how these enhancements can mitigate risks and protect your infrastructure with robust node isolation.

Speakers

James Munnelly

Staff Field Engineer, Apple

James Munnelly is a Field Engineer at Apple, helping customers adopt and adapt Kubernetes, and driving adoption of OSS cloud native technologies. James is also the founder of the cert-manager project, a Kubernetes extension for managing x509 certificates. He's an active member of... Read More →

Anish Ramasekar

Principal Software Engineer, Microsoft

Anish Ramasekar is a software engineer at Microsoft. He is on the Azure Container Upstream team building features for Kubernetes upstream and various CNCF projects that are part of the Azure Kubernetes Service. Anish is a maintainer of the Secrets Store CSI Driver project.

Rogue No More Securing Kubernetes with Node Specific Restrictions pdf

Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 151 G

Security

Content Experience Level Intermediate

2:00pm MST

Bloomberg’s Journey to Improve Resource Utilization in a Multi-Cluster Platform - Yao Weng & Leon Zhou, Bloomberg

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 2 | 250 AD

Bloomberg provides an on-premises Data Science Platform (DSP) using cloud-native software to support internal AI model training. It runs on Kubernetes clusters spanning multiple data centers and featuring a diverse range of GPU types. However, managing such a large-scale and heterogeneous GPU environment poses many challenges, such as improving resource utilization, reducing operational costs, and scheduling workloads across different GPU types. In collaboration with the Karmada community, Bloomberg's DSP team has aimed to tackle these challenges by addressing multi-cluster batch job management problems. This talk will delve into the approaches the team has adopted, including: - Intelligently scheduling GPU workloads across multiple clusters - Using Karmada's resource interpreter to support Kubernetes Custom Resource Definitions (CRDs) on top of a multi-cluster architecture - Building a highly available Karmada control plane - Establishing a consistent training job submission interface

Speakers

Leon Zhou

Software Engineer, Bloomberg

Leon Zhou is a software engineer on the Data Science Platform engineering team at Bloomberg. With prior NLP experience, he is now building ML platforms to facilitate machine learning development. He is interested in ML infrastructure to enable large-scale training and complex pipelines... Read More →

Yao Weng

Senior Software Engineer, Bloomberg

Yao Weng is a Senior Software Engineer on Bloomberg’s Data Science Platform engineering team. She has contributed extensively to optimizing the company’s Kubernetes environment for high performance compute, model inference, and workflow orchestration. Yao Weng obtained her Ph.D... Read More →

Kubecon NA 2024 Slides Bloomberg's Journey to Improve Resource Utilization in a Multi Cluster Platform pptx

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 250 AD

AI + ML

Content Experience Level Intermediate

2:00pm MST

Testing Kubernetes Without Kubernetes: A Networking Deep Dive - John Howard, Solo.io

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 1 | 155 E

There are few things more tedious than waiting for a long end-to-end test to run. Waiting for a new cluster to spin up, images to build and push - not to mention things like debugging or running on slow internet connections. Unfortunately, these complex setups are hard to avoid, especially if we are testing things deeply integrated into Kubernetes networking, such as CNIs, kube-proxy, services meshes, and more. It doesn't have to be this way! In this talk, I will give a deep dive on how we built out our testing strategy for our Kubernetes networking proxy to not really depend on Kubernetes (or docker, or root). In doing so, I will not only offer a glimpse behind the scenes of Istio development, but also give viewers a deeper understand of how the fundamentals of Kubernetes (Linux primitives like namespaces) work, and how they can be effectively used to improve tests in the Istio ecosystem and beyond.

Speakers

John Howard

John Howard, Solo.io

John Howard is a Senior Architect at Solo.io and Istio Technical Oversight Committee member.

KubeCon NA 24 Testing Kubernetes without Kubernetes pdf

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

2:00pm MST

Object Storage Is All You Need - Justin Cormack, Docker

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 1 | Grand Ballroom A

When Jeff Bezos commissioned Amazon S3 he called it "malloc for the web"; since then many people have considered cloud object storage to be a weird kind of non Posix filesystem, but also a great backing store for websites or storing lots of data. Recently more and more applications are being built with object storage as the entire persistence layer. This started with analytics databases such as Snowflake and Databricks, and the open source Delta Lake and Apache Iceberg projects. More recently the use is spreading to even more applications, from observability to streaming data and more. In this talk we look at why it is becoming so popular, the benefits, downsides and performance characteristics, and how and when to use it effectively.

Speakers

Justin Cormack

CTO, Docker

Justin is the CTO of Docker, recently a member of the CNCF TOC, and has been working in the container ecosystem and in supply chain security for many years.

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

2:00pm MST

Faster Containerized LLM Serving via Knowledge Sharing - Junchen Jiang, University of Chicago & Zhou Sun, Mooncake Labs

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 2 | 255 B

Imagine once an LLM learns something from a document, the knowledge can be instantly shared with other LLMs. Unfortunately, today, LLMs must read the same document multiple times, causing a significant slowdown. This session will introduce a new KNOWLEDGE-SHARING system that enables LLMs to share their digested knowledge, in the form of KV caches, so only one LLM needs to process each document. The key challenge is how to store the KV caches cheaply and serve them quickly. Instead of keeping the KV caches of all reusable chunks in GPU/CPU memory, we show a DEMO that with careful implementation on Kubernetes, storing them on cheaper devices is not only economically superior but also delivers significant reductions in LLM serving delay, especially the time to the first token.

Speakers

Junchen Jiang

Professor, University of Chicago

Junchen Jiang is an Assistant Professor of Computer Science at the University of Chicago. He works at the intersections between networked systems and machine learning. He received his Ph.D. from CMU in 2017 and his bachelor’s degree from Tsinghua in 2011. He has received a Google... Read More →

Zhou Sun

CEO, Mooncake Labs

Mooncake Labs is working on the next generation of stateless data architecture, bringing database performance and functionality to structured and unstructured data in datalakes and raw datasets. Previous I lead the query team at SingleStore (cloud-native distributed HTAP database... Read More →

KCNA24 Talk Slides pdf

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 255 B

Emerging + Advanced

Content Experience Level Intermediate

2:00pm MST

Supercharge Your Kubernetes Autoscaling with Custom Metrics - Vamshi Krishna Samudrala & Sravan Akinapally, American Airlines

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 1 | 155 B

Out-of-the-box, Kubernetes provides native horizontal scaling capabilities driven by conventional resource consumption signals like CPU and memory utilization. However, in the real world, numerous applications demand dynamic scaling orchestrated by custom business telemetry such as queue depths, throughput volumes, or other domain-specific indicators. This session will unravel the secrets of extending Kubernetes' Horizontal Pod Autoscaler (HPA) to leverage custom metrics as scaling triggers, unlocking unprecedented scaling autonomy. Attendees will witness live demos showcasing: Deploying a custom metrics provider to expose application-centric metrics to the Kubernetes control plane Configuring the HPA to consume these custom metrics for intelligent scaling decisions A sample application dynamically scaling based on a custom metric like queue length or requests per second Best practices for crafting bespoke scaling policies tailored to custom metrics.

Speakers

Vamshi krishna Samudrala

Enterprise Cloud Architect, American Airlines

Enterprise Architect with a distinguished career spanning 14 years in the fields of DevOps and Cloud Architecture. Focused on automation, configuration management and innovation with cutting-edge technologies.Worked extensively with leading cloud service providers, including Amazon... Read More →

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

2:00pm MST

Micro-Segmentation and Multi-Tenancy: The Brown M&Ms of Platform Engineering - Jim Bugwadia, Nirmata & Rachael Wonnacott, Fidelity International

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 1 | Grand Ballroom H

A key requirement for internal developer platforms is that they serve multiple workloads. The reality of platform engineering is that while it seeks to lower the barrier to entry for teams to deliver applications, it must also balance cost and ensure appropriate levels of security. It’s therefore essential to consider how application components running on shared infrastructure are allowed to communicate with each other and weigh up the cost of each architecture. In industry, we have seen differing approaches to deploying Kubernetes to achieve these goals, from multiple single-tenant clusters through to shared clusters that deliver namespaces-as-a-service. Rachael and Jim will define the concepts of multi-tenancy and micro-segmentation for cloud native systems, explain why they are critical to success with platform engineering. They will also show real-world examples of how they can be implemented, and demonstrate full automation using best practices like GitOps and Policy as Code.

Speakers

Jim Bugwadia

Co-founder and CEO, Nirmata

Jim Bugwadia is a co-founder and the CEO of Nirmata, the Kubernetes policy and governance company. Jim is an active contributor in the cloud native community and currently serves as co-chair of the Kubernetes Policy and Multi-Tenancy Working Groups. Jim is also a co-creator and maintainer... Read More →

Rachael Wonnacott

Technical Product Owner, Kubernetes Platform, Fidelity International

Rachael has spent the last decade focused on platform engineering. She places a conscious emphasis on improving flow and is on the quest to smooth the application lifecycle for developers in the enterprise. With a background in astrophysics, Rachael brings her scientific approach... Read More →

KCNA24 Brown M&Ms of Platform Engineering Nov 7 2024 pdf

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | Grand Ballroom H

Platform Engineering

Content Experience Level Intermediate

2:00pm MST

The Missing Talk About API Versioning & Evolution in Your Developer Platform - Stefan Schimanski, Upbound & Sergiusz Urbaniak, Independent

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 2 | 251 AD

In the realm of developer platforms, individuals without extensive experience in the cloud-native ecosystem are now venturing into the creation of Kubernetes-based APIs. Tools like Crossplane are transforming every platform engineer into an API designer. Ten years in, the ecosystem still offers little guidance on Kubernetes versioning and API evolution in practice. A naive understanding is not helpful, and many have been burned by relying on intuition. This talk will provide deep, yet applicable knowledge, starting from the first principles of the invariants to maintain when changing APIs in Kubernetes. It will cover tools like schemas, conversion, validation, and admission, and present very concrete and directly applicable API Evolution Patterns. These patterns will help navigate the life cycle of CRD-based projects. This talk aims to educate on how to evolve APIs effectively and safely without inadvertently breaking users.

Speakers

Sergiusz Urbaniak

Team Lead - Kubernetes, https://mongodb.com

Sergiusz is a Kubernetes Team Lead at MongoDB. He is enthusiastic about modern infrastructure software while still enjoying minimalistic networking techniques like morse code. He worked on Mesos, container runtimes, Prometheus Operator, Thanos, upstream Kubernetes, Operators, and... Read More →

Stefan Schimanski

Senior Principal Software Engineer, Upbound

Stefan is a Senior Principal Engineer at Upbound working on control planes, Kubernetes, kcp, and as a tech-lead in Sig API Machinery. He contributed a major part of the CRD feature set. Stefan is a 2nd time GoogleSummer of Code mentor with CNCF, loves to teach and help people to learn... Read More →

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 251 AD

Platform Engineering

Content Experience Level Intermediate

2:00pm MST

The Policy Engines Showdown - Gabriel L. Manor, Permit.io; Andres Aguiar, Okta; Omri Gazitt, Aserto; Pauline Jamin, Agicap; Tyler Schade, Geico; Joy Scharmen, StrongDM

Friday November 15, 2024 2:00pm - 2:35pm MST

Salt Palace | Level 2 | 254 B

OPA, Cedar, OpenFGA, Topaz, OPAL, OSO, should I continue? Policy engines, languages, and standards are everywhere, making the decision for a good decision engine increasingly difficult. In this panel, I'll host four talented engineers, each from a different policy engine's core team, for a friendly showdown. We will assist the audience in making the most important decision - choosing a suitable and fitting decision engine for their specific use case. We will also delve into the nuances of running multiple engines together and learn how to scale them properly.

Speakers

Pauline Jamin

Staff Software Engineer, Agicap

Staff software engineer with a love for Domain-Driven Design (DDD) and back-end development. Skilled in leading teams and embracing the Site Reliability Engineering (SRE) philosophy. When not crafting code, you'll find me exploring the great outdoors with my loyal dog. Catch me sharing... Read More →

Tyler Schade

Distinguished Engineer, GEICO

Living in Miami, Florida, I'm an engineering lead at GEICO working on service mesh and traffic management. Prior to joining GEICO, I was at Solo.io, working on multi-cluster service mesh and API gateways. I love learning more about networking and distributed systems and sharing what... Read More →

Joy Scharmen

Senior Director, Infrastructure Engineering, StrongDM

Passionate about infrastructure, and I love learning. Tell me about the great ideas you have for building scalable sustainable humane systems!

Gabriel Manor

Director of DevRel, Permit.io

Gabriel is a senior full-stack developer who blends his passion for technical leadership, security, authorization, and devtools into his current role as the Head of Growth and DevRel at Permit.io. Before joining Permit.io, Gabriel worked as a technical leader and principal engineer... Read More →

Omri Gazitt

Co-founder & CEO, Aserto

Omri is the co-founder/CEO of Aserto, an authorization startup, and his third entrepreneurial venture. He's spent the majority of his 30-year career working on developer and infrastructure technology, most recently as the CPO of Puppet. Previously he was the VP and GM of HP's Cloud... Read More →

Andres Aguiar

Product Manager, Okta

Andres has spent his 20+ year career building tools for developers, wearing different hats. He’s been working on the identity space for the last 6 years, and is currently the Product Manager for OpenFGA.

talk pdf

Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 254 B

Security

Content Experience Level Intermediate

2:00pm MST

Tutorial: Simplify and Optimize Your YAML with YAMLScript - Ingy döt Net, YAML LLC

Friday November 15, 2024 2:00pm - 3:30pm MST

Salt Palace | Level 1 | Grand Ballroom G

Nobody likes YAML (or anything for that matter) when its a giant and repetitive mess. Of course, there are already existing technologies like Helm and Kustomize that help provide make YAML nicer for Kubernetes. The new kid on the block is YAMLScript. Being a complete programming language (built over a vast and mature ecosystem) its capabilities are effectively limitless. That said, its primary focus is on refactoring and improving existing and new large YAML configurations. YAMLScript can help you make the most of YAML in any domain; even those that already make great use of Helm and Kustomize. Having been created by an original inventor and current lead maintainer of the YAML data language (Ingy döt Net) you can count on it meshing well with the YAML you already know. In this hands on interactive tutorial, Ingy will teach you how to make the most of YAML and YAMLScript.

Speakers

Ingy döt؜؜ Net

Ingy döt Net, YAML LLC

Ingy döt Net is one of the original inventors of the YAML data language, and its primary maintainer. He has continuously contributed to Open Source efforts since before it was called Open Source. His passion is creating software libraries that work in as many programming languages... Read More →

Friday November 15, 2024 2:00pm - 3:30pm MST
Salt Palace | Level 1 | Grand Ballroom G

Tutorials, SDLC (Software Development Lifecycle)

Content Experience Level Intermediate

2:55pm MST

Thousands of Gamers, One Kubernetes Network - Surya Seetharaman, Red Hat & Girish Moodalbail, NVIDIA Inc

Friday November 15, 2024 2:55pm - 3:30pm MST

Salt Palace | Level 1 | 155 E

Uninterrupted gameplay with minimal network latency, jitter, and maximum throughput is crucial for a great gamer experience. But how do we maintain consistent network quality in cloud gaming production environments at NVIDIA when 2K+ players (pods) share the same physical network for game storage and streaming? When a new player joins and a pod starts downloading large contextual game data, it is vital to shield other players on the same node from this 'noisy neighbor'. Kubernetes provides limited pod-level traffic shaping but we needed more than that. In this talk we will show how we achieved true Quality of Service and wire-speed networking on Kubernetes clusters using Differentiated Services Code Point (RFC7657) markings on pod traffic. Through a live demo that will involve a noisy pod and a victim pod, attendees will gain actionable insights and best practices around packet-parameter-tuned traffic shaping using simple Kubernetes Custom Resources to optimize network performance.

Speakers

Girish Moodalbail

Distinguished Engineer, NVIDIA Inc, NVIDIA Inc

Girish Moodalbail, a Distinguished Engineer at Nvidia Inc., builds Kubernetes-based GPU compute for gaming, AI training, and inferencing with low-latency, high-throughput, reliable, scalable, and secure networking using OSS (OVS, OVN, OVN-K8s CNI) and NVIDIA hardware. With over 22... Read More →

Surya Seetharaman

Principal Software Engineer, Red Hat Inc.

Surya is an Open Source advocate and contributor, active in the Kubernetes SIG-Network working group. She is working as a Principal Software Engineer at Red Hat in the OpenShift Networking team. Her areas of interest include Cloud Infrastructure and Networked Services and Systems... Read More →

Thousands of Gamers, One Kubernetes Network pdf

Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

2:55pm MST

Object, Block, or File Storage? Choosing the Right Cloud Storage to Integrate Into Kubernetes - Mitch Becker & Tom McDonald, Amazon Web Services (AWS)

Friday November 15, 2024 2:55pm - 3:30pm MST

Salt Palace | Level 1 | Grand Ballroom A

This presentation helps simplify the container storage landscape to assist K8s users make educated cloud storage choices based on their workload requirements and data strategy. You already know K8s is a an open-source platform that orchestrates containerized applications. But what type of cloud storage should one deploy for stateless and stateful applications to ensure persistent data across various operational scenarios? Different storage types cater to specific use cases within K8s environments. Organizations often require persistent storage to run K8s for stateful use cases such as Large-Scale Application Deployment, High-Performance Computing (HPC), AI/ML, Microservices Management, CI/CD Pipelines, and Big Data Processing. Because Block, File, and Object Storage are used in varying ways for containerized workloads, this talk will explain use cases for each storage type and educate the attendees so their selection of storage supports their applications and overall data strategy.

Speakers

Tom McDonald

Sr. Storage Specialist SA, AWS

Tom McDonald is a Senior Workload Storage Specialist at AWS. Starting with an Atari 400 and re-programming tapes, Tom began a long interest in increasing performance on any storage service. With 20 years of experience in the Upstream Energy domain, file systems and High-Performance... Read More →

Mitch Becker

Sr. Storage Specialist, Amazon Web Services (AWS)

Accomplished cloud professional transforming and modernizing IT environments: Cloud Computing, Cloud Storage, HPC, AI, Containers, DevOps, & Cloud Adoption/Migration/Transformation. • CNCF Storage Technical Advisory Group Member • AWS --- Certified Cloud Practitioner, Industry... Read More →

Object, Block, or File Storage for K8s pdf

Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

2:55pm MST

This Platform Goes to 11: Boost Developer Productivity with Lessons from Salesforce - Joe Kutner, Salesforce

Friday November 15, 2024 2:55pm - 3:30pm MST

Salt Palace | Level 2 | 251 AD

Internal platforms play an essential role in boosting the productivity of developers who use cloud native technologies. That’s why Salesforce, a global leader in the cloud for more than two decades, evolved its existing collection of managed services and capabilities into a cohesive platform that delights developers. In this talk, you’ll learn how Salesforce's platform removes friction, unifies interfaces, and meets developers where they are with industry standard tooling. As you design and build your own platforms, you’ll be able to use the same principles that guided Salesforce to accelerate day-1 onboarding of new apps, increase the speed of the developer inner-loop and testing cycles, and reduce the time it takes to deliver new code to production. Our lessons learned will help you avoid missteps. Finally, you’ll learn how to measure developer satisfaction, performance, activity, collaboration, and efficiency to ensure that your platform delivers the most value for your developers.

Speakers

Joe Kutner

Software Architect, Salesforce

Joe is co-founder of the Cloud Native Buildpacks project, which aims to make containerization more secure and more developer friendly. He started the project in 2018 while working as DX Architect at Salesforce Heroku, and today is the DX Architect for Salesforce’s Hyperforce platform... Read More →

This Platform Goes to 11 Boost Developer Productivity with Lessons from Salesforce pptx

Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 2 | 251 AD

Platform Engineering

Content Experience Level Intermediate

4:00pm MST

Divide and Conquer: Master GPU Partitioning and Visualize Savings with OpenCost - Kaysie Yu & Ally Ford, Microsoft

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 2 | 255 E

Kubernetes is the ideal platform for running AI and ML workloads, such as LLMs. GPU nodes are often used for their parallel processing capabilities and higher performance benefits; however, they are known to be costly. Many factors impact the cost of running AI/ML workloads such as GPU utilization, GPU VM size, idle time, etc. These costs are often ignored and considered inherent in running GPU workloads. But if running workloads at scale and left unoptimized, costs will quickly spin out of control. In this talk, we leverage NVIDIA DCGM exporter with Prometheus for GPU metrics monitoring alongside OpenCost to measure the Kubernetes spend of our GPU workloads. We will provide an overview of OpenCost, highlighting its role in bridging the gap between the developer and platform teams through visibility and accountability of spend. We will demonstrate how to use the NVIDIA GPU Operator and how techniques such as partitioning can lead to significant cost savings.

Speakers

Ally Ford

Product Manager, Microsoft

Ally is a Product Manager on the Azure Kubernetes Service (AKS) team at Microsoft Azure. She spends her days collaborating with customers to design features that improve the end to end operator experience for both Linux and Windows users. Formerly she was a UX designer and project... Read More →

Kaysie

Product Manager, Microsoft

Kaysie Yu is a Product Manager on the Azure Kubernetes Service team at Microsoft. She works on cost management and optimization and is passionate about the convergence of FinOps and GreenOps, advocating for best practices that help organizations achieve cost efficiency while contributing... Read More →

Divide and Conquer Master GPU Partitioning and Visualize Savings with OpenCost pdf

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

4:00pm MST

Topology Aware Routing: Understanding the Tradeoffs - Rob Scott, Google

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 1 | 155 E

In Kubernetes 1.31, a new TrafficDistribution field on Services graduated to beta. This is effectively our third attempt at solving Topology Aware Routing in Kubernetes. This talk will tell the story of how we got here and what we learned along the way, outlining what exactly has made this problem so surprisingly complex. With that context, we’ll dive into exactly how Traffic Distribution works today, and when you should configure it. You’ll learn about how it’s implemented today, and how better implementations may be written in the future. We'll walk through some examples to show how it can work well, and when it may not. Finally, we’ll cover how this concept will interact with autoscaling, load balancers, Ingresses, Gateways, and Multi-Cluster Services. You should leave this talk with a clear understanding of how Topology Aware Routing works in Kubernetes, when to use it, and a broad awareness of the work that’s still in progress in this space.

Speakers

Rob Scott

Software Engineer, Google

KubeCon SLC Topology Aware Routing Understanding the Tradeoffs pdf

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | 155 E

Connectivity

Content Experience Level Intermediate

4:00pm MST

The Node Tetris Rabbit Hole: Why Your Binpacking Might Be Underperforming - Hannah Taub, Adobe Inc.

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 1 | 155 B

Have you ever looked at your Kubernetes cluster and thought “I have a perfectly good autoscaler! Why are all my nodes at less than 50% capacity?” When a team moves to the scale of hundreds of clusters with thousands of nodes, efficient binpacking changes from a side task to a financial necessity. From inefficient client apps to long-buried cluster configs, follow the Adobe Ethos team as they track down leads on what’s causing cluster underutilization and how to fix it. You will also learn some tips for designing your clusters to avoid these issues in the first place.

Speakers

Hannah Taub

Ms., Adobe Inc.

As a senior software engineer, Hannah has been working with Adobe’s Cloud Cost Efficiency team for the past several years. After graduating from the University of Edinburgh, she went from writing content APIs at Viacom (now Paramount) to building out Adobe’s Ethos Kubernetes CI/CD... Read More →

node tetris rabbit hole kubecon 2024 pptx

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

4:00pm MST

Migratory Patterns: Making Architectural Transitions with Confidence and Grace - Pete Hodgson, PartnerSlate

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 2 | 255 B

Big technical migrations - like switching databases - can feel like you're swapping out the engine of a bus while continuing to drive down the freeway (with all your users screaming in the back). However, there are ways to make these transitions safe, incremental, low-stress. In this talk we'll walk through a real-world case study of switching a production system from one database to another with no downtime, and no tears, using techniques like Expand/Contract, Dark Launch and Parallel Run. We'll also see hands-on examples of using CNCF open standards like Open Feature and Open Telemetry to manage this migration.

Speakers

Pete Hodgson

CTO, PartnerSlate

Pete Hodgson is an independent software delivery consultant. He helps engineering teams to level up and tackle their thorniest challenges, with a focus on agile engineering practices, architectural evolution, and lean process management. Prior to going independent he spent several... Read More →

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 255 B

SDLC

Content Experience Level Intermediate

4:00pm MST

SPIFFE the Easy Way: Universal X509 and JWT Identities Using cert-manager - Tim Ramlot & Ashley Davis, Venafi

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 1 | Grand Ballroom B

SPIFFE is incredible. Each workload is assigned its own universal identity, simplifying the security and management of communications in distributed systems. While SPIRE (the reference SPIFFE implementation) is exceptionally powerful, it is also quite complex. Deploying SPIRE on Kubernetes requires StatefulSets, which can be challenging and frustrating. Many cloud vendors are starting to offer turnkey SPIFFE solutions, but that comes with risk of vendor lock-in. In this talk, we will demonstrate how to use the Cloud Native cert-manager solution to implement SPIFFE (x509 and JWT) with low operational overhead for all Kubernetes workloads. The session includes all you need to know to issue X.509 SVIDs, use them and validate them. Additionally, we will introduce an experimental solution to convert x509 SVIDs into JWT SVIDs. The demo will highlight how to authenticate to third-party APIs (such as AWS, GCP, Azure, and others) using these JWT SVIDs.

Speakers

Ashley Davis

Staff Software Engineer, Venafi

As a teenager, Ash taught himself to program after wondering how exactly video games were made. That led to adventures trawling through open source codebases, sparking an interest in computers spanning from bare-metal machine code right up to scalable distributed platforms like Kubernetes... Read More →

Tim Ramlot

Senior Software Engineer - cert-manager maintainer, Venafi

Tim started working at Venafi as a software engineer after his graduation as computer science engineer at Ghent University. He learned about cert-manager and Venafi through a Google Summer of Code internship. His mission at Venafi is to advance his problem solving skills, whilst contributing... Read More →

KubeCon NA 2024 spiffe pdf

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | Grand Ballroom B

Security

Content Experience Level Intermediate

4:00pm MST

Why Perfect Compliance Is the Enemy of Good Kubernetes Security - Michele Chubirka, Google

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 2 | 254 B

Technology organizations often struggle over who should manage the security of their Kubernetes environment. This task usually falls to platform or cloud engineering teams, but they often feel abandoned by their security counterparts, uncertain of which requirements will deliver real security value. While published benchmarks and security guides for Kubernetes are helpful, not all recommendations work for every use-case. They may require Kubernetes alpha or beta features which could cause issues with platform stability. Our desire to prioritize “perfect” security over having a functional platform that addresses relevant risks can leave us with nothing, frustrating everyone. Kubernetes is meant to increase application delivery velocity, but when overly strict compliance prevents a team from moving forward, they will subvert security requirements. Let’s stop obsessing over the red in our security and compliance dashboards and focus on what adds real value by reducing risk.

Speakers

Michele Chubirka

Cloud Security Advocate, Google

Michele Chubirka is a recovering Unix and network engineer currently working as a cloud security advocate for Google. She has been an architect, podcaster and freelance writer for various B2B publications such as Network Computing, Dark Reading and TechTarget. She likes long walks... Read More →

KCNA24 perfect compliance enemy good k8s security pdf

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 254 B

Security

Content Experience Level Intermediate

4:55pm MST

Best of Both Worlds: Integrating Slurm with Kubernetes in a Kubernetes Native Way - Eduardo Arango Gutierrez, NVIDIA & Angel Beltre, Sandia National Laboratories

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 2 | 250 AD

It's not always clear which container orchestration system is best suited for a given use case. Slurm, for example, is often preferred over Kubernetes when running large-scale distributed workloads. As a result, organizations areoften faced a hard choice: do they deploy Slurm or Kubernetes to service the rising demands of their AI/ML workloads. In this talk, we introduce K-Foundry, an open-source custom controller for KCP that translates Kubernetes jobs to Slurm jobs and exposes Slurm nodes and cluster info as Kubernetes Custom Resource Definitions (CRDs). This integration combines Slurm’s robust job scheduling with Kubernetes' dynamic orchestration and API-driven ecosystem, easing the administration of both clusters through a common API. This session will end with a live demo, where attendees will see how this integration bridges the gap between cloud and HPC, facilitating resource management and optimizing performance for large-scale AI and LLM tasks.

Speakers

Eduardo Arango Gutierez DE

Senior systems software engineer, NVIDIA

Eduardo is a Senior Systems Software Engineer at NVIDIA, working on the Cloud Native Technologies team. Eduardo has focused on enabling users to build and deploy containers on distributed environments.

Angel Beltre

Senior Member of Technical Staff, Sandia National Laboratories

Angel Beltre serves as a senior member of the technical staff within the Scalable System Software department at Sandia National Laboratories. He is a contributor to the CSSE Computing-as-a-Service (CaaS) initiative, aimed at streamlining the deployment of modeling and simulation tools... Read More →

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 250 AD

AI + ML

Content Experience Level Intermediate

4:55pm MST

Distributed Multi-Node Model Inference Using the LeaderWorkerSet API - Abdullah Gharaibeh & Rupeng Liu, Google

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 2 | 255 E

Large Language Models have shown remarkable capabilities in various tasks, from text generation to code writing. However, the inference process for these models presents significant challenges. LLMs are computationally intensive, often requiring specialized hardware like TPUs or GPUs to achieve reasonable response times. In some cases their substantial size can strain the resources of a single machine. Specifically, models such as Gemini, Claude, and GPT4 are too large to fit on any single GPU or TPU device, let alone on any single multi-accelerator machine, necessitating what we refer to as multi-node server deployment where a single model server “backend” runs as a distributed process on multiple nodes to harness enough accelerator memory to fit and run the model. This talk presents LeaderWorkerSet, a new k8s API that enables multi-node model inference. We demonstrate its capabilities by orchestrating state of the art model servers such as vLLM and JetStream on both GPUs and TPUs.

Speakers

Abdullah Gharaibeh

Staff Software Engineer, Google

Abdullah is a staff software engineer at Google and sig-scheduling and working group batch co-chair. He works on Kubernetes and Google Kubernetes Engine, focusing on scheduling and batch workloads.

Rupeng Liu

Software engineer, Google

Rupeng Liu, a software engineer from the Google's Kubernetes inference team

LeaderWorkerSet for distributed inference.pptx (1) pdf

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

4:55pm MST

Service Profiling Based Management and Scheduling in K8s - Jia Deng, Cong Xu & Mingmeng Luo, Bytedance

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 1 | 155 B

We present an open-source solution for the efficient management of resources and scheduling strategies in K8s. Our solution constructs workload-specific resource profiles based on their historical utilization patterns. This approach ensures that workloads receive adequate resources while optimizing overall resource utilization. To accomplish this objective, we employ a custom resource Service Profiling Description (SPD), facilitating a direct correlation between workloads and their resource usages, such as deployments and stateful sets etc. Resource utilization metrics, including CPU, disk I/O, and network I/O, are meticulously collected and aggregated. These usage indicators play a pivotal role in informing the scheduler's decisions regarding workloads allocation. This solution has been deployed within large-scale K8s clusters, addressing diverse workload demands, ranging from those requiring dedicated NUMA nodes to those capable of resource sharing among themselves.

Speakers

Mingmeng Luo

Software Engineer, Bytedance

Mingmeng Luo is a software engineer in the Infrastructure Department at ByteDance, where he specializes in the design and development of precision resource management technologies for large-scale Kubernetes clusters. His work focuses on optimizing resource allocation and efficiency... Read More →

Cong Xu

Senior Software Engineer, Bytedance

Cong Xu is a Tech Lead and Senior Software Engineer at ByteDance, where he focuses on building and optimizing the container-based cloud platform that hosts internal products such as Douyin and TikTok. From 2016 to 2022, he served as a Staff Research Member at IBM Research, contributing... Read More →

Jia Deng

Software Engineer, Bytedance

The speaker currently works for bytedance K8s orchestration team. Before that, the speaker worked for amazon EKSA and VMware Tanzu Mission Control.

KCNA24 2024 Service Profiling Based Resource Management and Scheduling pdf

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 1 | 155 B

Operations + Performance

Content Experience Level Intermediate

4:55pm MST

Zero Downtime Upgrades at Scale: How Okta Manages Hundreds of Clusters Daily - Jérémy Albuixech & Kahou Lei, Okta

Friday November 15, 2024 4:55pm - 5:30pm MST

Salt Palace | Level 2 | 251 AD

How do you upgrade your K8s clusters? Perhaps a rolling update of nodes, with services moving around? Can you guarantee a zero-downtime upgrade? Will this method scale and support the velocity of production environments? Likely not. But fear not - you are not alone! At Okta, we maintain hundreds of clusters, each hosting >130 services, with node counts ranging from 20-400 and we are updating them daily. How do we do it? Without an out-of-the-box solutions we had to build our own and we want to share what we learned with all of you! In this talk Kahou and Jeremy will go over the challenges and successes, highlighting how their deployment method provides the foundational blocks to build extra features while reducing the blast radius when something goes wrong - thanks to quick rollbacks and a canary rollouts. In this session attendees will learn how we leverage open source technologies to tackle three main problems: how to scale, how to secure and how to upgrade clusters with no downtime.

Speakers

Jérémy Albuixech

Staff Software Engineer, Okta

Jeremy is a Staff Software Engineer at Okta. Starting as a full stack programmer with a good foundation in Javascript, he then gravitated towards a DevOps role and later became a member of the SRE team at Cisco, picking up an IaC, observability and Kubernetes skillset. With the Okta... Read More →

Kahou Lei

Principal Software Engineer, Okta

Kahou Lei is a Principal Software Engineer with a strong background in Cloud infrastructure and Kubernetes. With 20 years of industry experience, he has held significant positions at renowned companies such as Okta and Cisco. Kahou leads critical software engineering initiatives... Read More →

Zero Downtime Upgrades at Scale How Okta Manages Hundres of Clusters Daily pdf

Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 251 AD

Platform Engineering

Content Experience Level Intermediate