Loading…
Attending this event?
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Intermediate clear filter
Tuesday, November 12
 

5:55pm MST

⚡ Lightning Talk: Is Everyone O-KEDA? “Exciting” Lessons Learned in Our Journey to Use KEDA Pod Autoscaling - Brian Davis, Red Canary
Tuesday November 12, 2024 5:55pm - 6:00pm MST
We thought that changing our Kubernetes pod autoscaler seemed like a really straightforward thing to do. With relative ease, we yanked out our old custom pod autoscaler and replaced it with KEDA. We were impressed with the flexibility and control we now had in our cluster, but then discovered a set of really hard lessons that no one had anticipated. In this lightning talk, I’ll hit the highlights of secondary issues we encountered due to such a seemingly simple change, such as Docker Hub rate limits, Kubernetes metrics server failures and their exciting impact on our cluster, AWS rate limits, and late night fights with Argo CD for control of pod maximums. Lastly, I’ll share my personal favorite topic: the “Night Club Theory” of autoscaling tuning. If you or someone you love is thinking of changing your autoscaler, I recommend spending 5 minutes with me to learn the things you should be aware of before you make the switch!
Speakers
avatar for Brian Davis

Brian Davis

Principal Software Engineer, Red Canary
Brian Davis is a Principal Engineer at Red Canary and has built complex systems for the past two decades. His career started in signal processing algorithm research but has morphed through the years into software engineering, QA, system integration, system design, and architectur... Read More →
Tuesday November 12, 2024 5:55pm - 6:00pm MST
Hyatt Regency | Level 4 | Regency Ballroom BCD

6:05pm MST

⚡ Lightning Talk: Minimizing Data Loss Within the OpenTelemetry (OTel) Collector - Alex Kats, Capital One
Tuesday November 12, 2024 6:05pm - 6:10pm MST
The OTel collector is meant to serve as a reliable and highly performant data pipeline. However, as a single component in a wider observability architecture, it is only as reliable as the downstream platforms/services it exports data to. The OTel collector has several built in mechanisms that aim to minimize the impact of unhealthy downstream exporters, including an out of the box sending queue with an additional configuration parameter for persistent queueing. There is a new component in the OTel contrib distribution, the Failover Connector. The Failover Connector allows for dynamic routing or “failover” of telemetry data based on downstream exporter health. This provides significant improvement to the data resiliency of the collector, as telemetry data can be continuously exported to a set of stable secondary locations, while the issues with the primary are resolved.
Speakers
avatar for Alex Kats

Alex Kats

Software Engineer, Capital One
Alex is a software engineer at Capital One. Alex has significant experience within the Observability space, with an emphasis on OpenTelemetry (OTel). Alex is a member of the OpenTelemetry community and has been contributing to various components within the OTel toolset.
Tuesday November 12, 2024 6:05pm - 6:10pm MST
Hyatt Regency | Level 4 | Regency Ballroom BCD

6:15pm MST

⚡ Lightning Talk: Safer Cluster Upgrades with Mixed Version Proxy - Richa Banker, Google
Tuesday November 12, 2024 6:15pm - 6:20pm MST
Upgrading Kubernetes clusters often presents numerous challenges, including potential downtime, compatibility issues, and the complexity of managing multiple versions. The Mixed Version Proxy feature introduced in Kubernetes 1.28 aims to mitigate these challenges. This talk will delve into the technical intricacies of the Mixed Version Proxy, exploring its design and implementation. We will then highlight the substantial benefits it offers for cluster upgrades, such as minimizing downtime and enhancing overall reliability. Attendees will gain practical knowledge through (possibly a demonstration) on enabling and utilizing the Mixed Version Proxy. Finally, we will provide insights into the future roadmap for this feature, including upcoming beta releases and enhancements.
Speakers
avatar for Richa Banker

Richa Banker

Richa Banker, Google
Currently a software engineer at Google. Exploring and contributing to OSS Kubernetes on the side.
Tuesday November 12, 2024 6:15pm - 6:20pm MST
Hyatt Regency | Level 4 | Regency Ballroom BCD
 
Wednesday, November 13
 

11:15am MST

Architecting Tomorrow: The Heterogeneous Compute Resources for New Types of Workloads - Alexander Kanevskiy, Intel Finland
Wednesday November 13, 2024 11:15am - 11:50am MST
Imagine managing a set of diverse workloads on a Kubernetes node, operating across dozens of CPU cores and several memory zones. But do you truly comprehend the difference between one CPU core versus another? Are you aware of the impact that different memory zone might have on your workload's efficiency? Will optimisations for one type of workloads be helpful for another? Do you think that your ML workload will behave same way as e.g. Redis? This presentation delves deep into CPU internals, memory types (DRAM, HBM, CXL), and diverse cache/core types and layouts. Explore recent hardware advancements and their impact on workloads. We'll examine native compute resource allocation strategies from a hardware point of view, crucial for enhancing workload performance and optimising energy usage and cost efficiency. Join and learn details of the modern hardware architecture that gives you a framework to make more informed choices on hardware resource optimisation for your infrastructure.
Speakers
avatar for Alexander Kanevskiy

Alexander Kanevskiy

Principal Engineer, Cloud Orchestration Software, Intel Finland
Alexander is currently employed by Intel as Principal Engineer, Cloud Software, focusing on various aspects in Kubernetes: Resource Management, Device plugins for hardware accelerators, Cluster Lifecycle and Cluster APIs. Alexander has over 25+ years of experience in areas of Linux... Read More →
Wednesday November 13, 2024 11:15am - 11:50am MST
Salt Palace | Level 2 | 255 BC
  Emerging + Advanced

12:10pm MST

Operationalizing High-Performance GPU Clusters in Kubernetes: A Case Study of Databricks' DBRX - Will Gleich & Wai Wu, Databricks
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Training large language models (LLMs) on GPUs within Kubernetes environments involves significant configuration and complexity, often leading to unique failure scenarios. This presentation will cover the lessons learned from training DBRX, a state-of-the-art LLM, that we developed on a 400-node cluster with a primary workload utilizing 3072 GPUs and the tooling needed to measure and maintain a healthy fleet of nodes and underlying interconnect fabric. This will include: * How we implemented GPU health detection leveraging Prometheus and DCGM Exporter * How we monitor GPU Direct Remote Direct Memory Access (GDRDMA) and the challenges of monitoring components that bypass CPU * Discussion of failure scenarios during training, and how they were addressed Databricks Mosaic AI Training leverages GPU clusters across many cloud providers to maximize availability; we will also discuss the variations we see and how we had to engineer around them.
Speakers
WW

Wai Wu

Databricks
avatar for Will Gleich

Will Gleich

Sr. DevOps Engineer, Databricks
Will Gleich is a Sr. DevOps engineer at Databricks specializing in MLOps and Site Reliability Engineering.
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

12:10pm MST

Can Your Kubernetes Network Handle the Heat? Building Resilience with AI Chaos - Lior Lieberman, Google & Surya Seetharaman, Red Hat
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Kubernetes networking is complex with many APIs, numerous configurations and potential failure points. In the rapidly evolving world of cloud-native applications, ensuring your Kubernetes network can withstand unexpected failures is not just an advantage—it is a necessity. In this talk Surya and Lior, holding distinct leadership roles in Gateway API and NetworkPolicy API, will demonstrate how you can leverage AI-powered Chaos Engineering to stress test Gateways, NetworkPolicies, and Services on a live cluster! They will share their experiences and lessons learned from using Litmus and enhancing K8sGPT to design and execute AI Chaos experiments, as well as focusing on how you can proactively find gaps and bottlenecks in the network infrastructure. This is a great opportunity to learn from real-world disruption scenarios and participate in a collaborative discussion on how we can leverage AI to build robust Kubernetes Networks.
Speakers
avatar for Surya Seetharaman

Surya Seetharaman

Principal Software Engineer, Red Hat Inc.
Surya is an Open Source advocate and contributor, active in the Kubernetes SIG-Network working group. She is working as a Principal Software Engineer at Red Hat in the OpenShift Networking team. Her areas of interest include Cloud Infrastructure and Networked Services and Systems... Read More →
avatar for Lior Lieberman

Lior Lieberman

Site Reliability Engineer, Google
Lior is site reliability engineer at Google working on Google Compute Engine. He is a leading maintainer of ingress2gateway, and an active contributor to Kubernetes SIG network focused on Gateway API.
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | 155 EF
  Connectivity

12:10pm MST

Automated Multi-Cloud, Multi-Flavor Kubernetes Cluster Upgrades Using Operators - Ziyuan Chen, Databricks
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Databricks manages over a thousand k8s clusters across three major cloud providers which run critical workloads in cloud regions around the world. This talk describes the system we built to upgrade nodes’ operating system, k8s version, and other configs monthly, supporting EKS, AKS, GKE, and self-managed k8s. Our system is built on k8s operators and performs zero-downtime blue-green rolling updates, respects contracts with services with features like PDBs, maintenance windows, deferred node draining, and custom workload handling plugins. It enables easy rollbacks, has good observability, and incurs minimal human operational cost. This has allowed us to patch vulnerabilities and release infrastructure changes quickly and reliably across the fleet. We will also share our lessons learned on building several operators that work together using the controller-runtime framework, designing the declarative interfaces between them, and achieving consistent behavior across three clouds.
Speakers
avatar for Ziyuan Chen

Ziyuan Chen

Software Engineer, Databricks
Ziyuan Chen is a software engineer at Databricks. He has worked on Databricks' cloud platform and OS infrastructure.
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

12:10pm MST

Automated Multi-Cloud Blue-Green Cluster Rotations: Zero Downtime Upgrades at Scale - Sourav Khandelwal, Databricks
Wednesday November 13, 2024 12:10pm - 12:45pm MST
I will present the system developed for cluster rotations across Databricks’ fleet of over a thousand cloud-managed k8s clusters on AWS, Azure, and GCP. Blue-green cluster rotations, or cluster swaps (upgrading by creating a new k8s cluster with a new version/configuration & shifting workloads from the old cluster), allow us to implement major infrastructure changes and upgrade k8s versions with low risk through staged rollouts, seamless rollbacks, zero downtime, and minimal operator intervention. Our system includes a k8s-style continuous reconciliation mechanism to manage cluster swap lifecycles, a fast and reliable cluster state change discovery system, and a k8s workload migration system. We will share methodologies and experiences in constructing this loosely coupled system that orchestrates product workloads and cloud provider APIs for automated cluster swaps. This session will explore the challenges faced, and the benefits of automating large-scale, multi-cloud k8s upgrades.
Speakers
avatar for Sourav Khandelwal

Sourav Khandelwal

Sr. Software Engineer, Databricks
I am a seasoned software engineer with over 10 years of experience in designing and managing large-scale platforms in cloud-native environments. At Databricks, my significant contributions have been pivotal in launching our next-generation cloud infrastructure that helped to transition... Read More →
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | Grand Ballroom BDF
  Platform Engineering

12:10pm MST

The Hard Truth About GitOps and Database Rollbacks - Rotem Tamir, Ariga
Wednesday November 13, 2024 12:10pm - 12:45pm MST
For two decades now, the common practice for handling rollbacks of database schema migrations has been pre-planned "down migration scripts". A closer examination of this widely accepted truth reveals critical gaps that result in teams relying on risky, manual operations to roll back schema migrations in times of crisis. In this talk, we show why our existing tools and practices cannot deliver on the GitOps promise of "declarative" and "continuously reconciled" workflows and how we can use the Operator Pattern to build a new solution for robust and safe schema rollbacks.
Speakers
avatar for Rotem Tamir

Rotem Tamir

CTO, Ariga
Rotem Tamir (38), father of two. Co-founder and CTO of Ariga, co-maintainer of Atlas and Ent. Ex-data platform architect at Nexar, infrastructure team lead at ironSource.
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 2 | 250
  SDLC

2:30pm MST

Optimizing LLM Performance in Kubernetes with OpenTelemetry - Ashok Chandrasekar, Google & Liudmila Molkova, Microsoft
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Large Language Models are increasing in popularity and their deployments on Kubernetes have steadily increased. LLM applications bring new usage patterns that the industry does not have the expertise in. At the same time, there is a lack of observability in these deployments which makes it difficult to debug performance issues. We will present an end to end walkthrough of how you can leverage client and server LLM observability using Open Telemetry based on the recent efforts in the Kubernetes and Open Telemetry communities to standardize these across LLM clients and model servers. We will also demonstrate how to troubleshoot a real-world performance issue in your LLM deployment and how to optimize your LLM server setup for better performance on Kubernetes. We'll show how to use Kubernetes autoscaling based on custom model server metrics and demonstrate how they offer a superior alternative to using GPU utilization metrics for such deployments.
Speakers
avatar for Liudmila Molkova

Liudmila Molkova

Principal Software Engineer, Microsoft
Liudmila Molkova is a Principal Software Engineer at Microsoft working on observability and Azure client libraries. She is a co-author of distributed tracing implementations across the .NET ecosystem including HTTP client instrumentation and Azure Functions. Liudmila is an active... Read More →
avatar for Ashok Chandrasekar

Ashok Chandrasekar

Senior Software Engineer, Google
Ashok Chandrasekar is a Senior Software Engineer at Google working on AI/ML experience for Google Kubernetes Engine. Previously he was a Staff Engineer at VMware where he led the cluster lifecycle management area for Tanzu Mission Control. He has 7 years of Kubernetes experience working... Read More →
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

2:30pm MST

AIStore as a Fast Tier Storage Solution: Enhancing Petascale Deep Learning Across Cloud Backends - Abhishek Gaikwad & Aaron Wilson, NVIDIA
Wednesday November 13, 2024 2:30pm - 3:05pm MST
As deep learning continues to evolve, the demand for handling petascale datasets efficiently becomes paramount. Current cloud storage solutions often struggle with the speed (throughput) and cost-effectiveness required for these massive datasets, particularly due to the random access needs of machine learning workloads. This talk introduces AIStore (AIS) as a fast-tier storage solution designed to overcome these challenges by offering a fast, scalable, cost-effective tier for deep learning data. AIS features linear scalability with each added storage node - in fact, with each added drive. In this presentation, we will explore the architecture and benefits of AIStore, focusing on its linear scalability and high performance. This session will feature detailed benchmarks and use cases comparing the performance of accessing cloud datasets with and without AIStore, highlighting AIS's ability to deliver high per-GPU throughput and stable latencies.
Speakers
avatar for Abhishek Gaikwad

Abhishek Gaikwad

Software Engineer, NVIDIA
Abhishek Gaikwad is a Software Engineer at NVIDIA with a Master of Science degree in Computer Science from San Jose State University. As a key developer and maintainer of AIStore, Abhishek has played a crucial role in its design, development, and management. His contributions include... Read More →
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom GI
  Data Processing + Storage

2:30pm MST

Unifying Observability: Correlating Metrics, Traces, and Logs with Exemplars and OpenTelemetry - Anusha Reddy Narapureddy & Charlie Le, Apple
Wednesday November 13, 2024 2:30pm - 3:05pm MST
In modern distributed systems, observability is key to understanding application performance and behavior. While metrics, traces, and logs each provide valuable insights, their true power is realized when they are correlated. This talk will dive into the practical benefits and implementation of correlating these signals with exemplars using the OpenTelemetry SDK and Collector, and showcase the results in Grafana. Attendees will learn how to leverage OpenTelemetry to create exemplars which will allow them to navigate from either logs or metrics to their traces.
Speakers
avatar for Anusha Reddy Narapureddy

Anusha Reddy Narapureddy

Senior Software Engineer, Apple
Anusha is an enthusiastic software engineer who is passionate about observability, distributed systems, and cloud-native technologies. She has extensive experience in designing and building highly available, scalable, and fault-tolerant systems in the cloud.
avatar for Charlie Le

Charlie Le

Senior Software Engineer, Apple
Charlie is a software engineer at Apple, specializing in building and scaling cloud native observability solutions and infrastructure. Deeply inspired by the collaborative spirit of open source, he actively contributes to projects like Cortex and OpenTelemetry, shaping the future... Read More →
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom HJ
  Observability

2:30pm MST

Does My K8s Application Need CPR? Performance Evaluation of a Multi-Cluster Workload Management App - Braulio Dumba & Ezra Silvera, IBM
Wednesday November 13, 2024 2:30pm - 3:05pm MST
KubeStellar (KS) is an open-source Kubernetes multi-cluster workload configuration management system that can be used to manage AI workloads in multi-cluster environments. Hence, understanding KS performance is crucial especially when managing resource intensive AI workloads. In this talk, we will present our experience in analyzing the performance metrics of KS across several dimensions of scalability (e.g., number of bindingPolicies, workload description spaces and number of managed remote clusters) and challenges that arise when conducting performance experiments in a multi-cluster environment. Our insights will demonstrate the utility of benchmarking the performance of a multi-cluster Kubernetes workload management application. Additionally, in this talk, we will demonstrate the usefulness of using several opensource tools such as clusterloader2, kube-burner & kwok to evaluate the performance of multi-cluster Kubernetes management applications.
Speakers
avatar for Ezra Silvera

Ezra Silvera

Senior Technical Staff Member, IBM
Ezra Silvera is a Senior Technical Staff Member at IBM Research. His interests include distributed systems, cloud management, and cloud infrastructure. Ezra is passionate about open-source technologies and has been involved in several notable open source projects such as Docker, KubeVirt... Read More →
avatar for Braulio Dumba

Braulio Dumba

Staff Research Scientist, IBM
Dr. Braulio Dumba is a Staff Research Scientist at IBM Research. In 2018, he joined IBM under the Hybrid Cloud organization. His current research is focus on edge computing and hybrid cloud computing. Dr. Dumba earned a Ph.D. in Computer Science from University of Minnesota, Twin... Read More →
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

2:30pm MST

Better Pod Availability: A Survey of the Many Ways to Manage Workload Disruptions - Zach Loafman, Google
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Kubernetes Pods are ephemeral, but some are more ephemeral than others. Kubernetes provides a dizzying array of options to manage and handle Pod disruption. From PodDisruptionBudgets, to "safe-to-evict" annotations, GracefulTermination timeouts and more, it can be incredibly hard to determine the optimal solution for handling Pod disruption and how to manage gracefully terminating your application. Thankfully, due to the extensible nature of Kubernetes we can build CRDs and controllers that can simplify these complex topics for end users. In this talk, we'll present an in-depth analysis of the built-in options and how they work (or don't). While this problem is not unique to game-serving, we'll deep-dive and explain how Agones (an open-source session orchestration system layered on Kubernetes) solves this problem with a simple abstraction to hide the complexity!
Speakers
avatar for Zach Loafman

Zach Loafman

Staff Software Engineer, Google
Zach leads Google’s GKE Games team. He was previously lead of the Kubernetes Control Plane team for GKE, lead of the GKE Cluster Lifecycle team, worked on Kubernetes prior to GA, and was one of the founding members of the Google Kubernetes Engine team.
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom BDF
  Platform Engineering

2:30pm MST

Tutorial: Confidential Containers 101: A Hands-on Workshop - Archana Choudhary & Suraj Deshmukh, Microsoft
Wednesday November 13, 2024 2:30pm - 4:00pm MST
As traditional enterprises with stringent data protection requirements become cloud-native and migrate to Kubernetes on public clouds, they are wondering: “Is my data secure on this shared hardware? Can someone with a host access snoop on my data?” And especially, with the upcoming Digital Operational Resilience Act (DORA) in Europe mandating data protection in use, it’s crucial for users to familiarize themselves with solutions like Confidential Containers (CoCo), a CNCF sandbox project. In this, first of its kind, hands-on workshop we’ll dive deep into using CoCo with k8s. We’ll explore real-world challenges, such as ensuring data confidentiality from platform owners (cloud providers), and show you how to overcome them. Through practical exercises, you’ll learn to set up CoCo and secure your containerized workloads, turning theory into practice. Attendees will discover streamlined practices, find robust protection mechanisms, and gain strategic insights into adopting CoCo.
Speakers
avatar for Suraj Deshmukh

Suraj Deshmukh

Senior Software Engineer, Microsoft
Suraj is working on Confidential Containers open-source project for Microsoft. He has been working with Kubernetes since version 1.2. He is currently focused on integrating Kubernetes and Confidential Containers on Azure.
avatar for Archana Choudhary

Archana Choudhary

Ms, Microsoft
A software engineer who has been exploring cloud-native technologies, particularly focusing on confidential containers over the past several months.
Wednesday November 13, 2024 2:30pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom ACE
  Tutorials, Security

3:25pm MST

A Tale of 2 Drivers: GPU Configuration on the Fly Using DRA - Alay Patel & Varun Ramachandra Sekar US, Nvidia
Wednesday November 13, 2024 3:25pm - 4:00pm MST
NVIDIA’s GeForceNow is a cloud gaming service that allows users to stream video games from NVIDIA's servers to a wide range of devices, including PCs, Macs, Android devices, iOS devices, and smart TVs. Under the hood, it is powered by Kubernetes running Kubevirt VMs. For a seamless user experience, GeForceNow dynamically switches GPU drivers to accommodate either passing through an entire GPU or slicing it into multiple virtual GPUs, all while keeping utilization close to 100% across the datacenter. This poses significant challenges when using the traditional device plugin API provided by Kubernetes. In this talk, we explore GeForce Now’s journey to transition away from the traditional device plugin API in favor of Dynamic Resource Allocation (DRA). We'll share valuable insights for anyone looking to perform a similar migration of their own. Join us to learn about the challenges, solutions, and best practices to help optimize your GPU-accelerated workloads in the cloud.
Speakers
avatar for Alay Patel

Alay Patel

Senior Software Engineer, Nvidia
Alay is a Senior Software Engineer at Nvidia where he works on cloud gaming service, exposing infrastructure for GPU workloads. He is passionate about open source with a focus on Kubernetes and platform engineering.
avatar for Varun Ramachandra Sekar US

Varun Ramachandra Sekar US

Senior Software Engineer, Nvidia
Developer by day, Dog whisperer by night.
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 EF
  AI + ML

3:25pm MST

Using OpenTelemetry for Deep Observability Within Messaging Queues - Shivanshu Raj Shrivastava, SigNoz & Ekansh Gupta, Zeta
Wednesday November 13, 2024 3:25pm - 4:00pm MST
The recent changes in OpenTelemetry have made new semantic conventions and changes in agents to better monitor messaging queues such as Kafka, RabbitMQ, and Amazon SQS, etc. In this session, we'll discuss how those semantic conventions are standardizing the telemetry collected from producers, consumers, and the messaging queues, and how in-depth observability can be achieved by correlating producer-to-consumer spans with the metrics collected from Kafka. Additionally, We will demonstrate how the Kafka Java client side instrumentation enabled and JMX metrics collected from Kafka how OpenTelemetry instrumentation can help for metrics to trace and trace to metrics correlation and spot reasons for anomalies like increased consumer lag, partition failures, time taken by messaging queues. This will also help in giving the corresponding traces in time that can help end users to better delve into their infrastructures and optimize their asynchronous applications.
Speakers
avatar for Ekansh Gupta

Ekansh Gupta

SDE, Zeta
Ekansh is a Software Development Engineer with Zeta Suite, with active involvement in various open-source and cloud native communities for upwards two years now. He was previously an SDE Intern at SteamLabs. He is also a speaker for a couple of talks at PyCon, KubeCon and MozFests... Read More →
avatar for Shivanshu Raj Shrivastava

Shivanshu Raj Shrivastava

Founding Engineer, SigNoz
Shivanshu is a Founding Engineer at SigNoz, working on building an OTeL native observability product. He has a keen interest in deep tech and OSS. He is a CNCF ambassador and a member of CNCF projects like OTeL, k8s, and Istio. He has got the opportunity to mentor contributors in... Read More →
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom HJ
  Observability

3:25pm MST

Global Payments: Setting New Standards for Reliability in Cloud Native Multi-Region Applications - Trey Caliva, Global Payments
Wednesday November 13, 2024 3:25pm - 4:00pm MST
As a multinational FinTech provider, processing over 32 billion card transactions for 816 million accounts, Global Payments requires globally available architectures with quick disaster recovery while maintaining subsecond latencies. In addition, these workloads require strict adherence to compliance standards. This session will explore the high-level architectural decisions implemented in a cloud-native redesign and cloud migration of a mission critical legacy .NET application. Key cloud native tools leveraged include Kubernetes on GCP, and the use of CockroachDB as a cloud native database solution. We will explore how leveraging these cloud native technologies achieved extreme fault tolerance in a multi-region deployment, setting new standards for performance and reliability.
Speakers
avatar for Trey Caliva

Trey Caliva

Principal Cloud Architect, Global Payments
Trey Caliva is an Architect and engineer with 10+ years of hands-on experience planning, developing, managing, and securing deployments in Google Cloud and AWS. He is currently Principal Cloud Architect at Global Payments, a Fortune 500 company and a member of the S&P 500 focused... Read More →
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

3:25pm MST

Scale Job Triggering with a Distributed Scheduler - Cassie Coyle & Artur Souza, Diagrid
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Imagine scheduling thousands or millions of jobs that are persisted and triggered timely and resilient to downtime. Some jobs might be triggered every second while others need to reliably be triggered on the first day of the month. Achieving high throughput and reliability is critical for the performance and operational efficiency of modern distributed systems. How can traditional cron job scheduling be extended? How can distributed systems handle job scheduling with minimal downtime? What challenges arise when scaling job scheduling to thousands or millions of jobs? In this session, Artur and Cassie will delve into the design of Dapr’s distributed Scheduler and how users can start using it today. You will gain a comprehensive understanding of how Dapr’s Scheduler unblocks scalability of actors and workflows while also enabling new capabilities, like delayed pubsub and schedule job API.
Speakers
avatar for Artur Souza

Artur Souza

Head of Engineering, Diagrid
I am a maintainer of Dapr since 2019, helped the project reach the 1.0 stable version and keeping frequent releases since then. Currently Head of Engineering at Diagrid, leading the engineering teams building Conductor and the next generation of managed cloud native APIs via Dapr... Read More →
avatar for Cassie Coyle

Cassie Coyle

Software Engineer, Diagrid
Cassie, a devoted software engineer at Diagrid actively contributes to Dapr, focusing on Go backend development to simplify the creation of resilient, event-driven, and microservices-based apps. She is a member of the Dapr Day and AppDeveloperCon 2024 program committees. Her work... Read More →
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 250
  SDLC

3:25pm MST

CEL-Ebrating Simplicity: Mastering Kubernetes Policy Enforcement - Kevin Conner, Getup Cloud & Anish Ramasekar, Microsoft
Wednesday November 13, 2024 3:25pm - 4:00pm MST
As Kubernetes deployments grow increasingly complex, robust policy enforcement is crucial. The Common Expression Language (CEL) provides a powerful solution, enabling the creation of sophisticated, human-readable expressions for Kubernetes policies. This session explores CEL's integration with Kubernetes, simplifying policy definition and enforcement. Key takeaways: - Fundamentals of CEL and its Kubernetes integration. - Practical use cases for CEL in admission control, resource management, and security. - Enhancing policy expressiveness and flexibility with CEL. - Introduction to CEL Playground for testing and validating CEL expressions. Through live demos, learn to leverage CEL and CEL Playground for streamlined policy management in Kubernetes. Ideal for administrators, developers, and DevOps professionals, this session equips you to enhance your Kubernetes policies using CEL. Join us to discover how CEL and CEL Playground can transform your Kubernetes policy management.
Speakers
avatar for Anish Ramasekar

Anish Ramasekar

Principal Software Engineer, Microsoft
Anish Ramasekar is a software engineer at Microsoft. He is on the Azure Container Upstream team building features for Kubernetes upstream and various CNCF projects that are part of the Azure Kubernetes Service. Anish is a maintainer of the Secrets Store CSI Driver project.
avatar for Kevin Conner

Kevin Conner

Chief Engineer, Getup Cloud
Kevin Conner is the Chief Engineer at GetUp Cloud, a startup focused on Kubernetes and DevSecOps. He has worked at startups like Integrated Micro Products, Arjuna Technologies, JBoss, and Aviatrix, as well as Sun Microsystems and Red Hat where he led teams for Cloud Enablement, Service... Read More →
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | 151
  Security

4:30pm MST

Making Kubernetes Simpler for Accelerated Workloads - Susan Wu, Google; Lucy Sweet, Uber; Mitch McKenzie, Weave; Aditya Shanker, Crusoe
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Kubernetes and the open-source ecosystem for AI frameworks have been great for LLM innovation, empowering developers to build applications that use natural language as the interface to data. Yet, many developers and cluster operators struggle to put these frameworks into production use. In this session, hear from several platform engineers responsible for designing core infrastructure supporting accelerated workloads, services, large language model training and inference pipelines. You can expect to come away with guidance, hear of pitfalls to watch out for and learn how they successfully abstracted the infrastructure complexity to improve their research users' experience and velocity. Panelists include: Lucy Sweet, Senior Software Engineer (Infrastructure), Uber, Mitch McKenzie, Site Reliability Engineer - Machine Learning Operations, Weave, Susan Wu, Outbound Product Manager, Google
Speakers
avatar for Susan Wu

Susan Wu

Outbound Product Manager, Google
Susan is an Outbound Product Manager for Google Cloud, focusing on GKE Networking and Network Security. She previously led product and technical marketing roles at VMware, Sun/Oracle, Canonical, Docker, Citrix and Midokura (part of Sony Group). She is a frequent speaker at conferences... Read More →
avatar for Lucy Sweet

Lucy Sweet

Senior Software Engineer at Uber, Uber
Lucy is a Senior Software Engineer at Uber Denmark who works on software infrastructure
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 EF
  AI + ML

4:30pm MST

Platform Performance Optimization for AI - a Resource Management Perspective - Antti Kervinen, Intel & Dixita Narang, Google
Wednesday November 13, 2024 4:30pm - 5:05pm MST
How much node resource management can affect AI workload performance? What options are there? What is the trade-off between total throughput and low latencies? In this talk we take a systematic approach to Platform Performance Optimization. We walk through the whole path from goal setting, gathering data, analysis, visualizations and conclusions. At each stop along the path we share our practical experiences in a case of LLM inference optimization. You will find many considerations, findings and practical tricks to take away. For instance, how to instrument PyTorch without touching the source or a container image, how to enable changing what we are measuring without new expensive benchmark reruns, and how much more we can learn from visualizations compared to numeric averages and percentiles. Finally we share real results from our case: how resource management increased total token throughput per worker node by more than 3.5x from the baseline.
Speakers
avatar for Antti Kervinen

Antti Kervinen

Cloud Orchestration Software Engineer, Intel
Antti Kervinen is a Cloud Orchestration Software Engineer working at Intel, whose interest in Linux and distributed systems has led him from academic research of concurrency to the world of Kubernetes. When unplugged, Antti spends his time outdoors discovering wonders of nature.
avatar for Dixita Narang

Dixita Narang

Software Engineer, Google
Dixita Narang is a Software Engineer at Google on the Kubernetes Node team. With a primary focus on resource management within Kubernetes, Dixita is deeply involved in the development and advancement of the Memory QoS feature, which is currently in the alpha stage. She is a new contributor... Read More →
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

4:30pm MST

From Observability to Performance - Nadia Pinaeva, Red Hat & Antonio Ojea, Google
Wednesday November 13, 2024 4:30pm - 5:05pm MST
No matter how fast the Services on your Kubernetes cluster are, users would love them to be faster. But how do you get from a huge pile of metrics across a distributed system to real user experience improvements? There is a way, and with the right tools and the right approach, you can better understand and evaluate Service performance. In this talk, you'll learn how to identify the performance parameters that directly translate to user experience. We will explore how to collect performance metrics from running Kubernetes clusters without disrupting normal operations using tools like Prometheus, Grafana, kube-burner, and custom instrumentation. We will discuss how to translate the collected metrics and analysis into concrete actions and how to identify bottlenecks and implement optimizations to enhance Service performance. This talk is ideal for k8s networking developers, administrators, SREs, DevOps engineers, and anyone responsible for managing or optimizing Kubernetes networking.
Speakers
avatar for Antonio Ojea

Antonio Ojea

Software Engineer, Google
Antonio Ojea is a Software Engineer at Google, where he works on Kubernetes. He is one of the top contributors of the Kubernetes project, with a stronger presence on the areas of networking and reliability. He has a vast experience in Open Source, networking and distributed systems... Read More →
avatar for Nadia Pinaeva

Nadia Pinaeva

Senior Software Engineer, Red Hat
Nadia Pinaeva is a Senior Software Engineer at Red Hat working on Openshift Networking. She collaborates with the SIG-network-policy to improve network security for Kubernetes clusters, and works on ovn-kubernetes network plugin.
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 155 EF
  Connectivity

4:30pm MST

Experience in Designing & Implementing a Cloud Native Framework for Farm Data Analytics - Braulio Dumba, IBM & Gloire Rubambiza, Cornell University
Wednesday November 13, 2024 4:30pm - 5:05pm MST
This work is based on 17 months experience managing a digital agriculture platform that has aggregated and processed tens of gigabytes of data on 1500 cows on a commercial dairy farm. Significant challenges surfaced tied to multi-cluster management, fault-tolerance, and privacy as the number of applications and farm management models grew. To bridge this gap, we designed and implemented a cloud native networked system for multi-cluster configuration and management of farm data analytics that leverages KubeStellar and Software-Defined Farm paradigm. Our experience from designing, implementing and deploying this framework showcase how Kubernetes can enable farmers and agribusinesses to leverage the power of containerization and cloud-native computing to optimize workflows and streamline agricultural operations. This work presents progress towards cloud-native, scalable, and fault-tolerant data analytics in digital farming with potential environmental, financial, and societal impacts.
Speakers
avatar for Braulio Dumba

Braulio Dumba

Staff Research Scientist, IBM
Dr. Braulio Dumba is a Staff Research Scientist at IBM Research. In 2018, he joined IBM under the Hybrid Cloud organization. His current research is focus on edge computing and hybrid cloud computing. Dr. Dumba earned a Ph.D. in Computer Science from University of Minnesota, Twin... Read More →
avatar for Gloire Rubambiza

Gloire Rubambiza

Ph.D. Candidate, Cornell University
Gloire Rubambiza is a Ph.D. candidate in CS at Cornell University, where he conducts research in hybrid cloud computing for digital agriculture with an emphasis on societal impact. At Cornell, he was a University Fellow, a fellow of NSF National Research Traineeship in Digital Plant... Read More →
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 BC
  Emerging + Advanced

4:30pm MST

Perform Laser Focused Deployments by Deciding in Advance the Blast Radius - Kostis Kapelonis, Octopus deploy
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Progressive Delivery is an advanced deployment method that allows for zero-downtime application releases. Argo Rollouts is a Kubernetes controller that allows you to adopt progressive delivery in the form of blue/green and canary deployments. We see a lot of teams that choose an arbitrary number of clients that access the new version of a canary. Yes, it is very easy to send only 10% of the traffic to the new version of a Kubernetes deployment. But sometimes you want to choose WHICH 10% sees the new traffic. In this talk we will see several approaches on pinning down specific clients to the old or new version and advanced scenarios for sending canary traffic only to a specific subset of users such as internal employees or customers who have expressed their interest on seeing brand new releases as soon as possible.
Speakers
avatar for Kostis Kapelonis

Kostis Kapelonis

Developer Advocate, Codefresh by Octopus Deploy
Kostis is a software engineer/technical-writer dual class character. He lives and breathes automation, good testing practices and stress-free deployments with GitOps.
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 250
  SDLC

4:30pm MST

Expanding the Capabilities of Kubernetes Access Control - Jimmy Zelinskie, authzed & Lucas Käldström, Upbound
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Kubernetes RBAC is an effective way of managing ACLs in one cluster. However, there are many other effective paradigms out there, such as Attribute- & Relation-based Access Control. In this talk, we’ll demystify how these differ, and when to use respective paradigms, giving context and guidance. We’ll highlight how Kubernetes access control has recently evolved towards supporting lots of different use-cases. We take this opportunity to cover multiple perspectives: security within a single cluster (zooming in) and security within real-life production environments with external services and multiple clusters (zooming out). As containers became ubiquitous first with excellent tools like Docker, we believe the same can and will happen for access control, yielding uniform, interoperable and understandable authorization. Finally, we'll propose future work that could be done to supercharge Kubernetes and ensure it keeps up with the ever increasing security requirements in our industry.
Speakers
avatar for Lucas Käldström

Lucas Käldström

Senior Software Engineer, Upbound
Lucas is a Kubernetes and cloud native expert who has been serving the CNCF community in lead positions for 6 years. He’s awarded Top CNCF Ambassador 2017 with Sarah Novotny. Lucas was a co-lead for SIG Cluster Lifecycle, co-created kubeadm, Weave Ignite, and ported Kubernetes to... Read More →
avatar for Jimmy Zelinskie

Jimmy Zelinskie

Co-founder, authzed
Jimmy Zelinskie is a software engineer and product leader with a goal of democratizing software via open source development. He's currently CPO of authzed where he's focused on bringing hyperscaler best-practices in authorization to the industry at large. At CoreOS, he helped pioneer... Read More →
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 151
  Security

4:30pm MST

Tutorial: Get the Most Out of Your GPUs on Kubernetes with the GPU Operator - Eduardo Arango Gutierrez, Tariq Ibrahim, Amanda Moran & Christopher Desiniotis, NVIDIA; David Porter, Google
Wednesday November 13, 2024 4:30pm - 6:00pm MST
NVIDIA’s GPU operator has become the de-facto standard for managing GPUs in Kubernetes at scale. This tutorial provides in-depth, hands-on training on the various GPU sharing techniques that are possible with the GPU operator. Participants will learn to deploy jobs utilizing these sharing techniques, as well as get hands-on experience on the installation and configuration of the NVIDIA GPU Operator itself. This includes an in-depth exploration of its two primary CRDs: ClusterPolicy and NVIDIADriver. These CRDs are essential for configuring GPU-accelerated nodes, enabling GPU sharing mechanisms, and performing GPU driver upgrades. The session will culminate with practical use cases, such as training an AI/ML model and giving participants firsthand experience in managing a GPU-accelerated Kubernetes cluster.
Speakers
avatar for Christopher Desiniotis

Christopher Desiniotis

Senior Systems Software Engineer, NVIDIA
Christopher Desiniotis is a Senior Systems Software Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator, a widely used tool for managing GPUs in Kubernetes, and is focused on increasing... Read More →
avatar for David Porter

David Porter

Senior Software Engineer Google, Google
David Porter is a Senior Software Engineer at Google on the Kubernetes node team. David’s focus is on the kubelet node agent and the resource management area. He is primary maintainer of cAdvisor, a resource monitoring library widely used in kubernetes, reviewer of a SIG Node, and... Read More →
avatar for Eduardo Arango Gutierez DE

Eduardo Arango Gutierez DE

Senior systems software engineer, NVIDIA
Eduardo is a Senior Systems Software Engineer at NVIDIA, working on the Cloud Native Technologies team. Eduardo has focused on enabling users to build and deploy containers on distributed environments.
avatar for Tariq Ibrahim

Tariq Ibrahim

Senior Software Engineer, NVIDIA
Tariq Ibrahim is a Senior Cloud Platform Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator. He has also contributed to several cloud native OSS projects like kube-state-metrics, Istio... Read More →
avatar for Amanda Moran

Amanda Moran

https://www.nvidia.com/en-us/, NVIDIA
Amanda has been working in technology since graduating from SCU in 2012 with a Master’s in Science in CS. Prior to this she had graduated with an BS in Biology from UW. Amanda has worked the last 12 years as a Software Engineer, a Solutions Architect, and an Engineering Manager... Read More →
Wednesday November 13, 2024 4:30pm - 6:00pm MST
Salt Palace | Level 1 | Grand Ballroom ACE
  Tutorials, AI + ML

5:25pm MST

Detecting and Overcoming GPU Failures During ML Training - Sarah Belghiti, Wayve & Ganeshkumar Ashokavardhanan, Microsoft
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Scaling ML training demands powerful GPU infrastructure, and as model sizes and training scale increases, GPU failures become an expensive risk. From outright hardware faults to subtle performance degradation, undetected GPU problems can sabotage training jobs, inflating costs and slowing development. This talk dives into GPU failure challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on cloud provider and autonomous vehicle company experience, we will share best practices for efficient identification, remediation, and prevention of GPU failures. We will also explore cutting-edge ideas like CRIU and task pre-emption for GPU workloads.
Speakers
avatar for Ganeshkumar Ashokavardhanan

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft
Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →
avatar for Sarah Belghiti

Sarah Belghiti

ML Platform Engineer, Wayve
Sarah Belghiti is an ML Platform Engineer at Wayve, a leading developer of embodied intelligence for autonomous vehicles. She works on the infrastructure, scheduling and monitoring of ML workloads. With GPUs becoming an increasingly scarce resource, her focus has been on building... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 155 EF
  AI + ML

5:25pm MST

Production AI at Scale: Cloudera’s Journey in Building a Robust Inference Platform - Zoram Thanga & Peter Ableda, Cloudera
Wednesday November 13, 2024 5:25pm - 6:00pm MST
In this session, we talk about Cloudera AI Inference Service, a secure, large scale platform for generative AI and predictive inference workloads, built using state of the art Kubernetes, CNCF and Apache open source projects. We take the audience through our journey in building this platform and share the experiences we gained along the way. The platform is built using openness, security, scalability, performance and standards compliance as guiding principles. We demonstrate that it is possible to be open and secure at the same time, and that organizations can incorporate production grade AI inferencing into their Big Data environments. This session will cover the architecture of the platform, and explain how we handle performance, scaling, authentication, fine grained authorization and audit logging, all of which are critical considerations for production inferencing.
Speakers
avatar for Peter Ableda

Peter Ableda

Director, Product Management, Cloudera
Peter Ableda is the Director of Product Management for Cloudera’s AI product suite, bringing over a decade of experience in data management and advanced analytics. Holding a Master of Science degree in Computer Science from the Budapest University of Technology, Peter has dedicated... Read More →
avatar for Zoram Thanga

Zoram Thanga

Principal Engineer, Cloudera
Zoram is a Principal Engineer, Enterprise AI Platform in Cloudera. He has been working in the software industry for over 23 years, and has been involved in building clustering software, containers, file systems, analytical query engines, and ML/AI platforms. He is a committer in the... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

5:25pm MST

Creating Paved Paths for Platform Engineers - Ritesh Patel, Nirmata; Abby Bangser, Syntasso; Viktor Farcic, Upbound; Nicholas Morey, Akuity; Praseeda Sathaye, Amazon
Wednesday November 13, 2024 5:25pm - 6:00pm MST
The platform engineering team's role has evolved into a pivotal one as the custodian of the internal developer platform. However, these teams often find themselves in a quagmire of identifying the right components to include in their platforms, particularly in the ever-expanding CNCF landscape. This panel session discusses these challenges by exploring the concept of 'Paved Paths' as a strategic approach to guide platform teams in their journey of building an internal developer platform (IDP). 'Paved Paths' offers a solution by providing platform engineering teams with proven reference architectures (e.g. CNOE and the BACK Stack). This approach prevents them from starting from scratch and getting lost in the vast CNCF landscape. By offering proven and opinionated reference architectures, platform teams can focus on enhancing developer experiences and optimizing higher-level workflows rather than grappling with the complexities of identifying foundational components for their IDP.
Speakers
avatar for Viktor Farcic

Viktor Farcic

Developer Advocate, Upbound
Viktor Farcic is a lead rapscallion at Upbound, a member of the CNCF Ambassadors, Google Developer Experts, CDF Ambassadors, and GitHub Stars groups, and a published author. He is a host of the YouTube channel DevOps Toolkit and a co-host of DevOps Paradox.
avatar for Ritesh Patel

Ritesh Patel

Co-Founder & VP Product, Nirmata
Ritesh Patel is Co-founder and leads Products at Nirmata, the creators of Kyverno. At Nirmata, he is responsible for commercial products for Kubernetes security, governance, and automation. He also leads key technology partnerships. Ritesh has 20+ years of experience delivering enterprise... Read More →
avatar for Praseeda Sathaye

Praseeda Sathaye

Principal Specialist Solution Architect, Amazon (AWS)
Praseeda Sathaye is a Principal Specialist SA for App Modernization and Containers at Amazon Web Services based in Bay Area California. She has been focused on helping customers speed their cloud-native adoption journey by modernizing their platform infrastructure, internal architecture... Read More →
avatar for Nicholas Morey

Nicholas Morey

Senior Developer Advocate, Akuity
Nicholas Morey is a Platform Engineer with a passion for DevOps practices. He is on the team at Akuity as a Developer Advocate, working with the community on anything Argo and Kargo-related. He is an experienced Argo CD operator and a Certified Kubernetes Administrator.
avatar for Abby Bangser

Abby Bangser

Principal Engineer, Syntasso
Abby is a Principal Engineer at Syntasso delivering Kratix, an open-source cloud-native framework for building internal platforms on Kubernetes. Her keen interest in supporting internal development comes from over a decade of experience in consulting and product delivery roles across... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | Grand Ballroom BDF
  Platform Engineering

5:25pm MST

Taming Your Application’s Environments - Marcos Lilljedahl, Dagger & Mauricio "Salaboy" Salatino, Diagrid
Wednesday November 13, 2024 5:25pm - 6:00pm MST
How coupled are your applications code and pipelines to its target cloud or on-prem environment? Kubernetes helps us to abstract how we run our workloads. However, there are other aspects, like infrastructure dependencies, service configuration, build process, deployment descriptors, etc., which need to be considered to make an application portable across multiple environments. Focusing on these aspects make a big difference when migrating apps to reduce costs, meeting compliance requirements or leveraging a specific tech only available somewhere else. Join us to cover three techniques you can implement to level up your SDLC: - Modularizing and enhancing our delivery pipelines to simplify complex environments (Crossplane and Dagger) - Building consistent experiences around well-known interfaces (CloudEvents, Dapr, and OpenFeature) to minimize runtime drift. - Design with separation of concerns to enable fast feedback loops between development and operation teams (Argo CD, Knative)
Speakers
avatar for Marcos Lilljedahl

Marcos Lilljedahl

Software Engineer, Dagger
Dad, Docker Captain, OSS lover, helmsman and wine drinker. Father of a joyful kid and wannabe surfer. I like listening to jazz music and tinker with some fun projects when possible. Avid open source contributor.
avatar for Mauricio Salatino

Mauricio Salatino

OSS Software Engineer, Diagrid
Mauricio works as an Open Source Software Engineer at @Diagrid, contributing to and driving initiatives for the Dapr OSS project. Mauricio also serves as a Steering Committee member for the Knative Project and Co-Leading the Knative Functions initiative. He published a book titled... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 2 | 250
  SDLC

5:25pm MST

From Observability to Enforcement: Lessons Learned Implementing eBPF Runtime Security - Anna Kapuścińska & Kornilios Kourtis, Isovalent
Wednesday November 13, 2024 5:25pm - 6:00pm MST
eBPF is getting widely adopted in cloud native runtime security tools like Falco, KubeArmor, and Tetragon. Using eBPF we can collect relevant security events right in the kernel and pass them to Security Engineers for retroactive attack detection and response. Having reliable and complete visibility is great, but wouldn't it be even better to proactively prevent attacks in progress? This talk covers the Tetragon team’s experience moving from security observability to enforcement and lessons learned along the way: from defining security models to hardening interactions between the local kernel and distributed Kubernetes systems. It will deep dive into how eBPF-based enforcement works, why it differs from observability, and the challenges of implementing it. The audience will walk away understanding the inner workings and common pitfalls of eBPF-based runtime security.
Speakers
avatar for Kornilios Kourtis

Kornilios Kourtis

Dr, Isovalent
I am a software engineer at Isovalent, working on cloud-native networking, security, and observability using eBPF. Before that, I worked in industrial (IBM) and academic research (ETH Zurich, NTU Athens) in systems, including operating systems, storage and network stacks, and high-performance... Read More →
avatar for Anna Kapuścińska

Anna Kapuścińska

Software Engineer, Isovalent, now part of Cisco
Anna is a software engineer at Isovalent, focusing on eBPF-based observability and security. Her previous roles span the industry: she wore both developer and SRE hats, and worked in AdTech, FinTech, public healthcare, end-user SaaS company and a hosting provider. On good weather... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 151
  Security

6:00pm MST

🪧 Poster Session: 0.0.0.0 Day: Exploiting Localhost APIs from the Browser - Avi Lumelsky, Oligo
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Browser-based attacks are not new in the malicious landscape of attack patterns. Browsers remain a popular infiltration method for attackers.  While seemingly local, services running on localhost are accessible to the browser using a flaw we found, exposing the ports on the localhost network interface, and leaving the floodgates ajar to remote network attacks. In this live demo and attack simulation we’ll unveil a zero-day vulnerability (still under responsible disclosure) in Chrome and other browsers, and how we use the 0-day to attack developers behind firewalls. We will demonstrate remote code execution on a wildly popular open-source platform serving millions in the data engineering ecosystem, that seems to run on localhost. In our talk, we will present novel attack techniques, targeting developers and employees within an organization, that are behind firewalls. This will be a first-ever deep dive into this newly discovered zero-day vulnerability.
Speakers
avatar for Avi Lumelsky

Avi Lumelsky

AI Security Researcher, Oligo
Avi has a relentless curiosity about business, AI, security—and the places where all three connect. An experienced software engineer and architect, Avi’s cybersecurity skills were first honed in elite Israeli intelligence units. His work focuses on privacy in the age of AI and... Read More →
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase
  🪧 Poster Sessions, Security

6:00pm MST

🪧 Poster Session: Accepting Mortality: Strategies for Ultra-Long Running Stateful Workloads in K8s - Sebastian Beyvers & Maria Hansen, Giessen University
Wednesday November 13, 2024 6:00pm - 8:00pm MST
"Pods are mortal" is a well-known quote in the official Kubernetes documentation. For ultra-long running stateful workloads that take months to complete, this mortality comes with its own challenges. How do you react to hardware failures? What resource quotas are appropriate? What if the workload has no built-in persistence and does all its work in memory? For such workloads, failures can be fatal, potentially wiping out months of work. This session will show that despite all the obstacles, Kubernetes can still be a reasonable choice for running stateful workloads that take months to complete. Using real-world examples based on production workflows, we will show how we design, configure, run, and operate such workloads using K8s and Argo workflows. We will also show how intelligent checkpointing using CRIU can help us deal with failures and enables us to avoid some problems even before they occur.
Speakers
avatar for Sebastian Beyvers

Sebastian Beyvers

Distributed Systems Researcher, Giessen University
Sebastian Beyvers is a distributed systems researcher in bioinformatics and a cloud-native Rust developer at Giessen University. Sebastian's current work focuses on cloud-native data storage and processing solutions that try to harmonize existing national and international data ecosystems... Read More →
avatar for Maria Hansen

Maria Hansen

Research Associate, Giessen University
Maria Hansen is a research assistant in the field of (bio)informatics at Justus Liebig University Giessen. She is currently working on a cloud-native data orchestration system that aims to unite existing national and international data ecosystems.
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

6:00pm MST

🪧 Poster Session: Climatik: Cloud Native Sustainable LLM via Power Capping - Chen Wang, IBM & Vincent Hou, Bloomberg L.P.
Wednesday November 13, 2024 6:00pm - 8:00pm MST
As GenAI workloads grow, the need for advanced accelerators with higher power consumption is surging. NVIDIA GPU peak power has risen from 300W for V100 to 1000W for B100. However, current power infrastructure and cooling systems are not designed to handle rapid power increases, leading to challenges like limited accelerator deployment in some regions or overheating risks that could cause fire hazards. We propose Climatik, a dynamic power capping system that enables data center and cluster admins and developers to set power caps dynamically at the cluster, service namespace, and rack levels. Climatik leverages Kepler for observability and offers APIs for integration with Kubernetes control knobs, including autoscalers, schedulers, and queuing systems, to ensure power caps are maintained across all levels. We will demo how to use Climatik to configure power capping for a large language model (LLM) inference service on KServe and show how power capping influences KEDA on autoscaling.
Speakers
avatar for Chen Wang

Chen Wang

Senior Research Scientist, IBM
Chen Wang is a Staff Research Scientist at the IBM T.J. Watson Research Center. Her interests lie in Kubernetes, Container Cloud Resource Management, Cloud Native AI systems, and applying AI in Cloud system management. She is an open-source advocate, a Kubernetes contributor, and... Read More →
avatar for Vincent Hou

Vincent Hou

Senior Software Engineer, Bloomberg L.P.
Vincent Hou is a Chinese software engineer, who used to study in Belgium and is currently working in US. He has been an active open source contributor, since 2010. He used to be an active contributor to Cinder project, OpenStack block storage service, and a core committer of OpenWhisk... Read More →
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase
  🪧 Poster Sessions, AI + ML

6:00pm MST

🪧 Poster Session: Kubernetes as a Geographically Distributed System - Ildiko Vancsa, Open Infrastructure Foundation
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Kubernetes was designed to be the best container orchestration platform on top of a cloud infrastructure in one data center. What do you do when you want to take your deployment and grow it in various geographical locations, but sill keep it as part of one system? You will have to face with complexity and figure out infrastructure management on a massive scale, and neither of these is easy to tackle. However, you don't have to go back to the drawing board, because the platform that delivers on requirements and expectations, already exists and it is called StarlingX. The StarlingX project is a fully integrated, open source cloud platform that is running in production at large telecom operators, who rely on its distributed cloud architecture along with next-level container orchestration support, which is provided by Kubernetes. This talk will introduce the StarlingX platform, share highlights from its latest release and show how it takes Kubernetes to the next level!
Speakers
avatar for Ildiko Vancsa

Ildiko Vancsa

Director of Community, Open Infrastructure Foundation
Ildikó is working for the Open Infrastructure Foundation as Director of Community. As part of her role, she is the Community Manager for StarlingX and Kata Containers, and a co-leader of the OpenInfra Edge Computing Group. Ildikó has been contributing to projects like OpenStack... Read More →
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

6:00pm MST

🪧 Poster Session: Optimizing Pod Affinity in Kubernetes: A Mathematical Approach to Workload Placement - Jack Xue, Microsoft
Wednesday November 13, 2024 6:00pm - 8:00pm MST
A standout feature of Kubernetes is its sophisticated mechanism for pulling container images from repositories, aligning containers with the appropriate pods, and strategically deploying pods to nodes that meet their resource requirements—such as CPU, GPU, RAM, network, and storage. This process adheres to the defined affinity and anti-affinity specifications between pods and nodes. Despite these capabilities, the challenge of optimally arranging a multitude of workloads, each comprising several pods within a cluster, remains an ongoing endeavor. In our research, we illustrate that a set of YAML files, which detail a workload deployment request, can be systematically transformed into a Binary Integer Linear Programming (BILP) model. Depending on the specific optimization goals, the objective functions of the model can be tailored accordingly. With the imposition of broad conditions, it is feasible to derive an optimal solution that adheres to polynomial time complexity constraints.
Speakers
avatar for Jack Xue

Jack Xue

Principal Cloud Solution Architect, Microsoft
PhD & MBA. Principal Cloud Solution Architect, Microsoft
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

6:00pm MST

🪧 Poster Session: Unleashing the Power of Init and Sidecar Containers in Kubernetes - Carlos Sanchez & Natalia Angulo, Adobe
Wednesday November 13, 2024 6:00pm - 8:00pm MST
This session dives deep into the power of init and sidecar containers, the issues they solve and why they are very useful when managing Kubernetes workloads. We will explore real-world use cases that show how these tools can: * Simplify complex deployments: Break down intricate deployments into manageable steps. * Enhance security: Isolate security critical tasks within your pods and ongoing security measures. * Facilitate rapid and isolated changes: when everyone is interested in updating the same service, separation of concerns is critical for rapid development. * Boost application functionality: Utilize sidecar containers to inject essential functionalities like logging, monitoring, and networking capabilities without modifying your main application code. Our goal is to share our experience and challenges managing thousands of environments in Kubernetes, how we manage init and sidecar containers and what problems they solve for us.
Speakers
avatar for Natalia Angulo

Natalia Angulo

Software Developer Engineer, Adobe
Natalia Angulo is a Software Development Engineer at Adobe Experience Manager, contributing to Site Reliability tasks and the development of new features inside AEM, and specially helping with their infrastructure management. She is passionate about maths, coding puzzles and teaching... Read More →
avatar for Carlos Sanchez

Carlos Sanchez

Principal Scientist, Adobe
Carlos Sanchez is a Principal Scientist at Adobe Experience Manager, specializing in software automation, from build tools to Continuous Delivery and Progressive Delivery. Involved in Open Source for over 20 years, he is the author of the Jenkins Kubernetes plugin and a member of... Read More →
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

6:00pm MST

🪧 Poster Session: Unleashing the Power of Prediction to Proactively Scale Control Plane Components - Anubhav Aeron & Ryan Tay, Intuit
Wednesday November 13, 2024 6:00pm - 8:00pm MST
At Intuit, our control plane components such as IstioD are responsible for hundreds of applications per cluster. It is responsible for configuring data plane, as well as injecting the istio-proxy container. With an increase in application traffic, there is an increase in application pods, which results in the control plane to scale up. For critical control planes such as IstioD, it is wise to scale proactively, rather than as a reaction to increase in load. With traditional approaches, like tuning HPA thresholds, to scale in advance, we might pre scale even when not required due to outliers, which could be wasteful. At Intuit a novel deep learning forecasting model called N-HiTS was employed to solve this issue. This session will discuss and demo how we train N-HiTS, our most important model features, and how we deploy our service on a per-cluster basis to provide contextualized predictions for cost effective and on time auto-scaling.
Speakers
avatar for Anubhav Aeron

Anubhav Aeron

Staff SE, Intuit
Anubhav is a seasoned software engineer in the field of Cloud Native Technologies, and has been doing Kubernetes and Service Mesh since 2016. He developed Redis Cluster as a Service, and a Templating Engine while working at Yahoo! He is the lead maintainer of Admiral, which is an... Read More →
RT

Ryan Tay

Software Engineer, Intuit Inc.
As a software engineer on the Service Mesh team at Intuit, Ryan works to support Intuit's extensive Istio deployment through contributions to projects like Admiral. He has previously worked to reduce costs of cloud development environments for the Intuit API Gateway team. His main... Read More →
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

6:00pm MST

🪧 Poster Session: What's Happening with SPIFFE and WIMSE? - Daniel Feldman, Qusaic
Wednesday November 13, 2024 6:00pm - 8:00pm MST
This session will be a very brief overview of what's going on with the SPIFFE and WIMSE identity standards projects. SPIFFE is a CNCF effort to standardize workload identity implementations. That is, a SPIFFE implementation can grant services unique identities and credentials. WIMSE is an IETF effort to build on the SPIFFE foundation. In particular, it adds a new, unique token format that allows securely recording multi-hop identity information. Implementors will be able to use this token format to build complete, end-to-end, cryptographically auditable identity records.
Speakers
avatar for Daniel Feldman

Daniel Feldman

Founder, Qusaic
Daniel Feldman has worked with many companies, large and small, to deploy SPIFFE and SPIRE zero-trust identity.
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase
  🪧 Poster Sessions, Security
 
Thursday, November 14
 

11:00am MST

Harnessing the Power of Envoy Proxy for Building an LLM Gateway - Idit Levine, Solo.io
Thursday November 14, 2024 11:00am - 11:35am MST
As the demand for LLMs continues to soar, the need for secure, cost-conscious, and content-aware control over its usage is paramount. In this talk, we explore why Envoy Proxy is the optimal choice for building an LLM gateway, leveraging its unique architecture and capabilities. Unlike traditional proxies (e.g. NGINX), which rely on scripting languages for customization, Envoy Proxy stands out due to its extensibility features: filter architecture, callout architecture (ext-proc, ext-auth), and ability to dynamically load libraries. Combined with its high-performant, async core ( C++), Envoy can run as an ingress, egress and mesh gateway. We'll look at using Envoy proxy for LLM credential management, prompt guarding/decorting, analyzing content safety, usage controls, context-aware failover, and observability. Ideal for developers, architects, and tech enthusiasts looking to solve challenges around LLM usage and picking the right technologies for their platform infrastructure.
Speakers
avatar for Idit Levine

Idit Levine

Founder & CEO, Solo.io
Idit Levine is the founder and CEO of Solo.io, a company that creates open-source tools to assist enterprises in adopting and extending innovative cloud-native technologies while modernizing their existing IT investments. Solo.io is a top contributor to CNCF projects such as Envoy... Read More →
Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | 155 EF
  Connectivity

11:00am MST

Cooperative Scheduling for Stateful Systems - Michael Youssef & Laxman Prabhu, LinkedIn
Thursday November 14, 2024 11:00am - 11:35am MST
At LinkedIn, we develop many stateful systems and run them on tens of thousands of machines in our datacenters. As we move LinkedIn’s infrastructure to Kubernetes, we quickly realized that StatefulSet was not going to be enough to support running critical stateful systems and satisfy the safety and durability goals of the teams developing stateful systems. We've built first-class support for running stateful workloads on bare metal where the stateful systems can coordinate with Kubernetes to stay available and ensure durability. With our design, we support planned/unplanned maintenance, swapping out hardware, and allow stateful systems to customize their rollout policies natively on Kubernetes. This talk covers: - Our LiStatefulSet API. - How we allow apps to customize safety checks and deployment policies via an ApplicationClusterManager, our pluggable policy engine. - The ApplicationClusterManager protocol that allows coordination of the lifecycle of workloads with Kubernetes.
Speakers
LP

Laxman Prabhu

Staff Software Engineer, Systems Infrastructure, LinkedIn
avatar for Michael Youssef

Michael Youssef

Staff Software Engineer, LinkedIn
Michael is a Staff Software Engineer at LinkedIn, currently making management and deployment of sharded systems a touch less painful on Kubernetes. In his free time he enjoys spending time with his cat, inhaling chocolate, and playing tennis.
Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | Grand Ballroom GI
  Data Processing + Storage

11:00am MST

Kubernetes Workspaces: Enhancing Multi-Tenancy with Intelligent Apiserver Proxying - James Munnelly & Andrea Tosatto, Apple
Thursday November 14, 2024 11:00am - 11:35am MST
Multi-tenancy in Kubernetes means sacrificing essential features like cluster-scoped list/watches and multi-namespace/cluster-scoped RBAC. This often leads to additional complexity when configuring operators and forces discrepancies and friction with cluster-as-a-service type offerings. In this talk we will go through a demonstration of an intelligent Kubernetes apiserver proxy that introduces the concept of a ‘workspace’. Borrowing the name from the KCP project, a Workspace is a virtual apiserver endpoint that provides a ‘cluster-scoped’ view over a group of namespaces in a remote cluster. We’ll then go on to discuss optimisations and changes that we’d like to make within Kubernetes to better support apiserver proxying for multi-tiered caching, routing and scoping purposes.
Speakers
avatar for James Munnelly

James Munnelly

Staff Field Engineer, Apple
James Munnelly is a Field Engineer at Apple, helping customers adopt and adapt Kubernetes, and driving adoption of OSS cloud native technologies. James is also the founder of the cert-manager project, a Kubernetes extension for managing x509 certificates. He's an active member of... Read More →
avatar for Andrea Tosatto

Andrea Tosatto

Site Reliability Engineer, Apple
Andrea works at Apple as a Site Reliability Engineer. His day to day job consists in managing the lifecycle and ensuring the reliability of a multi-tenant compute platform built on top of Kubernetes. He is deeply passionate about multi-tenancy and any related topic, ranging from runtime... Read More →
Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 255 EF
  Emerging + Advanced

11:00am MST

Navigating the Cgroup Transition: Bridging the Gap Between Kubernetes and User Expectations - Sohan Kunkerkar, Red Hat Inc
Thursday November 14, 2024 11:00am - 11:35am MST
As Kubernetes and container technologies evolve, shifting from cgroup v1 to cgroup v2 has become a pivotal development. With cgroup v2 available in Kubernetes since v1.25, we're at a crossroads where many users and organizations must decide when and how to transition fully to this new system. Despite the benefits of cgroup v2, including better resource management and enhanced capabilities, users frequently encounter unexpected challenges signaling a gap in readiness and understanding. This talk will address the practical implications of moving to cgroup v2, discuss the coordinated efforts to deprecate cgroup v1, and propose actionable strategies to bridge the gap between the Kubernetes community, system administrators, and developers. By focusing on real-world experiences and providing clear guidance, this session aims to equip you with the knowledge and tools to navigate this significant change confidently.
Speakers
avatar for Sohan Kunkerkar

Sohan Kunkerkar

Senior Software Engineer, Red Hat Inc
Sohan Kunkerkar is a Senior Software Engineer at Red Hat, bringing expertise in distributed systems, backend engineering, and containers. His active contributions extend to CRI-O, a container runtime engine, and various sub-projects within the Kubernetes Sig-Node community. Sohan... Read More →
Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

11:00am MST

How We Made OpenTelemetry Be Our Fitness Tracker for Your CI/CD Pipelines! - Nicolas Woerner, Clario & Andreas Grabner, Dynatrace
Thursday November 14, 2024 11:00am - 11:35am MST
CI/CD pipelines are the heartbeat of modern cloud-native software delivery. Healthy pipelines ensure rapid and continuous deployments every time code gets committed to the Git repositories! Every new repository and commit puts more load on the CI/CD tool making it more challenging to keep this crucial heartbeat healthy! In this session, engineers from Clario will demonstrate how they leverage OpenTelemetry to observe, validate, report and optimize their CI/CD pipelines, keeping their deployments healthy despite increased scale and unlocking the full potential of modern software delivery on Kubernetes with GitLab.
Speakers
avatar for Andi Grabner

Andi Grabner

CNCF Ambassador and DevRel, Dynatrace
Andreas Grabner (@grabnerandi) has 20+ years of experience as a software developer, tester and architect and is an advocate for high-performing cloud scale applications. He is a CNCF ambassador, contributor to the CNCF project keptn and a DevRel for Dynatrace. Andreas is also a regular... Read More →
avatar for Nicolas Woerner

Nicolas Woerner

Associate DevOps Engineer, Clario
Nicolas Wörner works in the Platform Engineering Team at Clario. With a background in software and DevOps engineering he focuses on continuously enhancing the software delivery workflow at Clario. Nicolas is passionate about leveraging CNCF software to drive efficiency and reliability... Read More →
Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 250
  SDLC

11:55am MST

How to Move from Ingress to Gateway API with Minimal Hassle - Keith Mattix, Microsoft
Thursday November 14, 2024 11:55am - 12:30pm MST
For many, the Ingress resource was one of the first Kubernetes APIs they used, adding HTTP routing rules and SSL certs for cluster-external traffic. These APIs are used for production in clusters across the world today, configuring ingress gateways serving hundreds of thousands of connections per second. As of October 2023, the Ingress API has been superseded by the Gateway API, a new set of Kubernetes resources with over 20 implementations that enforces security best practices by design. However, migrating networking APIs is an intimidating task, and doing so safely is every company’s primary concern. Join this session to learn how to make this migration safe by identifying the best migration path, implementing Gateway API best practices, and utilizing community-supported migration tools such as ingress2gateway.
Speakers
avatar for Keith Mattix

Keith Mattix

Senior Software Engineering Lead, Microsoft
Keith Mattix is an Engineering Lead at Microsoft focused on Istio, Gateway API, and other networking projects.
Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 155 EF
  Connectivity

11:55am MST

Database DevOps: CD for Stateful Applications - Stephen Atwell, Harness.io & Christopher Crow, Pure Storage
Thursday November 14, 2024 11:55am - 12:30pm MST
Running stateful applications on Kubernetes can provide many of the same advantages as stateless applications. In this talk, Stephen and Chris will share some thoughts on managing stateful applications as part of a CD Pipeline so that applications - and the application's data - can be versioned and deployed safely and repeatedly. This talk will discuss managing persistent data within kubernetes, as well as managing structural changes to a database as part of a CD process. With Kubernetes and liquibase, we can provide something better than before: A more testable, repeatable, and open way to deploy stateful applications. This talk features a practical demo of how CD tooling can empower users to automate data migrations within Kubernetes.
Speakers
avatar for Christopher Crow

Christopher Crow

Technical Marketing Engineer, Pure Storage
Chris Crow works as a cloud architect at Portworx. He has worked previously as an education, systems administrator. He is a lifelong open-source enthusiast.
avatar for Stephen Atwell

Stephen Atwell

Principal Product Manager, Harness.io
With over 26 years of technology experience, Stephen focuses on solving problems encountered in his previous roles. He has worn hats ranging from network administrator, to database administrator, to software engineer, to product manager. Outside of work, Stephen develops open source... Read More →
Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | Grand Ballroom GI
  Data Processing + Storage

11:55am MST

Multi-Zone Clusters Inside and Out - Tom Dean, Buoyant
Thursday November 14, 2024 11:55am - 12:30pm MST
Multi-zone clusters are a great tool for improving application reliability — and also a great way to spend a ton of cash. Why? What really happens when you set these things up? How do you use them effectively without bankrupting your whole organization? In this session, we'll dig into the nuts and bolts of what goes on under the hood of a multi-zone cluster, including what a zone is, what Kubernetes understands about zones, how zones affect routing, and why multi-zone clusters can drive costs up. We'll spend some time on Kubernetes' Topology Aware Routing, covering its advantages as well as its very real limitations. Finally, we'll dive into how you can influence Kubernetes' choices to take advantage of multi-zone clusters' reliability while containing costs. Join us for learning and live demos!
Speakers
avatar for Tom Dean

Tom Dean

Field Engineer, Buoyant
Tom Dean started programming BASIC on Apple IIs over 40 years ago, and has been hooked on tech since then. A long-time user of Linux and Open Source, he has been expanding his Cloud, Cloud Native and adjacent subject matter knowledge to become a more well-rounded technologist, and... Read More →
Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

11:55am MST

From Chaos to Calm: Building a Unified and Scalable CI/CD Pipeline at Akamai - Tomer Patel, Akamai Technologies Inc.
Thursday November 14, 2024 11:55am - 12:30pm MST
Are you struggling with a chaotic development process? Join Akamai's talk and discover how we built a unified and scalable CI/CD pipeline, saving 40% of our QA, Performance, Dev, and Ops daily work, and how you can do that in your organization! This session dives into the architecture, key features, and its impact on development efficiency. You will learn how to: - Conquer cloud-native deployments by adding the right tools - such as Argo Rollouts, and Backstage - Integrate CI/CD tools (ArgoCD, Jenkins, DevSpace, Grafana, Prometheus, Thanos) for a smoother workflow. - Leverage best-in-breed, cost-efficient open-source solutions
Speakers
avatar for Tomer Patel

Tomer Patel

Senior Engineering Manager, Akamai Technologies Inc.
Tomer currently works as Senior Engineering Manager at Akamai Technologies, where he leads a group of Data engineers, Software developers and DevOps at scale. Previously Tomer worked as Team Lead at Clarizen (Now Planview).
Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 250
  SDLC

11:55am MST

What Agent to Trust with Your K8s: Falco, Tetragon or KubeAmor? - Henrik Rexed, Dynatrace
Thursday November 14, 2024 11:55am - 12:30pm MST
In the CNCF landscape we have plenty of ebpf based security solutions that help us protect our k8s cluster from runtime vulnerabilities. On paper though Falco, Tetragon and KubeArmor look very similar. Eventually you have to make a choice on which one best fits your needs. To give you additional insights to make your decision join this session. We have run extensive benchmarks against those three solutions and will answer the following questions that came out of our testing: - What are the different featuresets? - What about the performance impact of each agent? - Which privileges does each solution need? - What are the pros and cons across the three options?
Speakers
avatar for Henrik Rexed

Henrik Rexed

Cloud Native Advocate, Dynatrace
Henrik is a Cloud Native Advocate at Dynatrace, the leading Observability platform. Prior to Dynatrace, Henrik has worked more than 15 years, as Performance Engineer. Henrik Rexed Is Also one of the Organizer of the conferences named WOPR, KCD Austria and the owner of the Youtube... Read More →
Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 151
  Security

2:30pm MST

Unlocking Potential of Large Models in Production - Yuan Tang, Red Hat & Adam Tetelman, NVIDIA
Thursday November 14, 2024 2:30pm - 3:05pm MST
The recent paradigm shift from traditional ML to GenAI and LLMs has brought with it a new set of non-trivial LLMOps challenges around deployment, scaling, and operations that make building an inference platform to meet all business requirements an unsolved problem. This talk highlights these new challenges along with best-practices and solutions for building out large, scalable, and reliable inference platforms on top of cloud native technologies such as Kubernetes, Kubeflow, Kserve, and Knative. Which tools help effectively benchmark and assess the quality of an LLM? What type of storage and caching solutions enable quick auto-scaling and model downloads? How can you ensure your model is optimized for the specialized accelerators running in your cluster? How can A/B testing or rolling upgrades be accomplished with limited compute? What exactly do you monitor in an LLM? In this session we will use KServe as a case study to answer these questions and more.
Speakers
avatar for Yuan Tang

Yuan Tang

Principal Software Engineer, Red Hat
Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led AI infrastructure and platform teams at various companies. He holds leadership positions in open source projects, including Argo, Kubeflow, and Kubernetes. He's also a maintainer and... Read More →
avatar for Adam Tetelman

Adam Tetelman

Principal Product Architect, NVIDIA
Adam Tetelman is a principal architect at NVIDIA leading cloud native initiatives and CNCF engagements across the company; building inference platforms for NVIDIA AI Enterprise and DGX Cloud. He has degrees in computational robotics, computer & systems engineering, and cognitive science... Read More →
Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

2:30pm MST

How the Tables Have Turned: Kubernetes Says Goodbye to Iptables - Casey Davenport, Tigera & Dan Winship, Red Hat
Thursday November 14, 2024 2:30pm - 3:05pm MST
For decades, iptables has been the preferred packet filtering system in the Linux kernel. Used extensively across the Kubernetes networking ecosystem, iptables is now on the way out and is expected to be removed from the next generation of Linux distributions. With iptables past its prime, where does that leave Kubernetes? The successor to iptables -- nftables -- is ready to carry the torch instead, with a newly released beta kube-proxy implementation in v1.31 and network policy using Calico’s nftables backend. In this talk, Dan and Casey will share what they have learned building Kubernetes Service and NetworkPolicy implementations using nftables. They will cover the history and current status of iptables usage in Kubernetes, the capabilities and performance characteristics of Kubernetes networks running on nftables, and why eBPF may not be the right tool for the job.
Speakers
avatar for Casey Davenport

Casey Davenport

Casey Davenport, Tigera
Casey is a core developer on Calico and has been building Kubernetes networking systems since 2016.
avatar for Dan Winship

Dan Winship

Senior Principal Software Engineer, Red Hat
Dan is a Tech Lead for Kubernetes SIG Network and has been working on Kubernetes and OpenShift networking for 7 years at Red Hat.
Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | 155 EF
  Connectivity

2:30pm MST

Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster - Yuichiro Ueno & Toru Komatsu, Preferred Networks, Inc.
Thursday November 14, 2024 2:30pm - 3:05pm MST
Today, storage technologies play a fundamental role in the realm of AI/ML. Read performance is essential for swiftly moving datasets from storage to AI accelerators. However, the rapid enhancement of AI accelerators' performance often outpaces I/O, bottlenecks the training. Due to the scheduling of pods in Kubernetes across multiple nodes, utilizing node-local storage effectively presents a challenge. To address this, we introduce a distributed cache system built atop node-local storages, designed for AI/ML workloads. This cache system has been successfully deployed on our on-premise 1024+ GPUs Kubernetes cluster within a multi-tenancy environment. Throughout our two-year experience operating this cache system, we have overcome numerous hurdles across several components, including the I/O library, load balancers, and the storage backend. We will share the challenges and the solutions we implemented, leading to a system delivering 50+ GB/s throughput and less than 2ms latency.
Speakers
avatar for Toru Komatsu

Toru Komatsu

Engineer, Preferred Networks, Inc.
Toru is a machine learning platform engineer at Preferred Networks in Japan. He is the creator and lead developer of youki, an OCI Runtime in Rust, and a maintainer of the OCI Runtime Specification. Additionally, he serves as a reviewer for runwasi and is involved in developing a world that utilizes containers and Wasm. Additionally, he is a member of the Kubernetes org and is especially interested in... Read More →
avatar for Yuichiro Ueno

Yuichiro Ueno

Engineer, Preferred Networks, Inc.
He is currently a machine learning platform engineer at Preferred Networks in Japan. His research and engineering interests include a range of high-performance computing (distributed deep learning, networking/RDMA, and storage technologies), performance engineering, and Kubernete... Read More →
Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom GI
  Data Processing + Storage

2:30pm MST

Low-Overhead, Zero-Instrumentation, Continuous Profiling for OpenTelemetry - Christos Kalkanis, Elastic
Thursday November 14, 2024 2:30pm - 3:05pm MST
Elastic has recently donated its whole-system continuous profiling agent to OpenTelemetry. After a thorough community review process, the donation was enthusiastically accepted. Leveraging eBPF, the profiling agent provides unprecedented visibility into the runtime behavior of all applications: it builds stacktraces that go from the kernel to userspace native code, all the way into code running into higher level runtimes, enabling users to identify performance regressions, reduce wasteful computations, and debug complex issues faster. This session will explore: - Benefits of eBPF-based continuous profiling compared to conventional approaches that rely on application instrumentation - How the agent builds profiles that seamlessly span kernel, native code and most widely used application runtimes - Integration with the rest of OpenTelemetry: OTLP and Collector
Speakers
avatar for Christos Kalkanis

Christos Kalkanis

Principal Software Engineer, Elastic
Christos is the technical lead for the edge collection group at Elastic, a maintainer for the OpenTelemetry Profiling SIG and a co-author of the donated OpenTelemetry profiling agent previously known as the Elastic Universal Profiling agent. After more than a decade of focusing on... Read More →
Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom HJ
  Observability

2:30pm MST

One Inventory to Rule Them All: Standardizing Multicluster Management - Corentin Debains, Google & Ryan Zhang, Microsoft
Thursday November 14, 2024 2:30pm - 3:05pm MST
Most Kubernetes users run more than one cluster, and some run hundreds or more. Crossing cluster boundaries has always been a challenge, because most Kubernetes APIs, tools, and operators are cluster-centric. In fact, there’s a remarkable lack of standard tools and patterns for multi-cluster. Over time users have found ways to stitch clusters together but the community has been asking for standardization.To share multi-cluster tools, Kubernetes sig-multicluster has introduced the “ClusterProfile” API, a critical building block for multi-cluster capabilities. This API provides a canonical way for multicluster controllers and users to iterate over clusters, and to install or manage multi-cluster features. In this talk, we will look at some of the problems inherent to multi-clustering, explain the concepts introduced by this new API and look at implementations and consumers of it.We dive into real life examples of patterns and usage, with products such as Kueue, ArgoCD, and Argo workflow.
Speakers
avatar for Ryan Zhang

Ryan Zhang

Principal Software Engineering Manager, Microsoft
Dr. Ryan Zhang is a Principal Software Engineering Manager at Microsoft, working on Azure Kubernetes Service Team. Ryan has been working on Cloud Native open source projects for the past few years including CloudEvents, Open Application Model (OAM) and multi-cluster related initi... Read More →
avatar for Corentin Debains

Corentin Debains

Software Engineer, Google
Corentin Debains is a software engineer at Google working on the GKE Fleet (multicluster platform). He is an active member of Kubernetes’ special interest group sig-multicluster.
Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

2:30pm MST

Mastering Cell-Based Architecture: Practical Solutions and Best Practices - Shweta Vohra, Booking.com & Asanka Abeysinghe, WSO2
Thursday November 14, 2024 2:30pm - 3:05pm MST
Are you struggling to validate your cell boundaries or facing challenges with greenfield versus brownfield cell-based architectures (CBA)? Do you find it difficult to define enterprise-wide cell boundaries or wish there were best practices to guide you? If these pain points sound familiar, this session is tailored for you. In this talk, we will first guide you through the process of defining an enterprise-wide cell-based architecture for your organization or context. Then we will explore best practices for greenfield, brownfield, and hybrid cell implementations using CBA. By translating common user challenges into actionable implementation references, we aim to elevate your understanding of CBA with real-world use cases and best practices. This session will also cover best practices for the data, security, application, and infrastructure layers, ensuring a comprehensive approach to CBA implementation. Join us to take your knowledge of CBA to the next level!
Speakers
avatar for Asanka Abeysinghe

Asanka Abeysinghe

CTO, WSO2
Asanka, WSO2's CTO, is a technology visionary with over 20 years of experience designing and implementing scalable distributed systems, microservices, and business integration solutions. He advances WSO2's corporate reference architecture, collaborates with customers and industry... Read More →
avatar for Shweta Vohra

Shweta Vohra

Enterprise Architect, Booking.com
Shweta is an Enterprise Architect and a Cloud Navigator! 🚀 As a seasoned Architect with a vast toolkit in Cloud, Platforms, Data, and ML technologies. She has spent over two decades crafting solutions across various domains and complexity levels. She is a frequent conference speaker... Read More →
Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 2 | 250
  SDLC

2:30pm MST

From Standards to Practice: The Journey to Container Maturity - Carmen Chow & Thomas Robinson, Yelp
Thursday November 14, 2024 2:30pm - 3:05pm MST
Yelp runs tens of thousands of Docker containers in Kubernetes. How do we track their vulnerabilities, baseline their security needs, and prioritize our most critical findings? Security standards change constantly, so we need a robust model of container maturity to guide our adoption of these standards in a way that addresses Yelp’s specific needs and risk tolerance. Finally, to maximize our model’s value, over 1,000 engineers must understand its practical guidance well enough to apply it to their daily work. This talk covers designing and incorporating a container maturity model into Yelp’s development lifecycle, along with our strategy for proactively improving our security posture. We believe our experiences will assist others in creating similar models that work for their organizations, help evaluate and assess risks to their own containers, and drive next steps towards future risk evaluation platforms.
Speakers
avatar for Carmen Chow

Carmen Chow

Software Engineer, Yelp
Carmen Chow is a Software Engineer on Yelp’s Infrastructure Security team, where she has worked on cost modeling, data lifecycle tools, and Kubernetes observability. Previously, she was an infrastructure developer responsible for containerizing services and migrating them to Kubernetes... Read More →
avatar for Thomas Robinson

Thomas Robinson

Software Engineer, Yelp
Tom is a software engineer living near Seattle, Washington. Having previously worked in security research and antivirus software, he's spent the last decade helping keep Yelp secure.
Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | 151
  Security

3:25pm MST

Kubernetes Multi-Cluster Networking 101 - Niranjan Shankar, Microsoft & Ram Vennam, Solo.io
Thursday November 14, 2024 3:25pm - 4:00pm MST
You’ve (somewhat) grasped the networking model of a single Kubernetes cluster. But how do you enable Pods to communicate across clusters? How do service discovery and DNS work for a multi-cluster setup? How do you secure inter-cluster traffic and manage certificates? Not sure? Don’t worry - this session will have the answers. We’ll start by outlining the core requirements for workloads to communicate across clusters. You’ll then learn some common multi-cluster networking topologies, like flat and multi-network setups, and how inter-cluster connectivity and IP address management differ for each of them. Finally, we’ll cover some popular tools for managing and securing traffic between clusters, like service mesh, CNIs, and gateways, and discuss their use-cases. You’ll leave this session with a solid understanding of fundamental terms and concepts - like virtual networking peering, external DNS, trust domains, etc - needed for navigating the multi-cluster networking landscape.
Speakers
avatar for Ram Vennam

Ram Vennam

Solutions Engineer, Solo.io
Ram Vennam is the Director of Solutions Engineering at Solo.io where he helps companies design and build highly scalable, resilient, distributed systems with the latest cloud-native technology. Previously, he was at IBM where he was a Technical Product Manager and Developer Advocate... Read More →
avatar for Niranjan Shankar

Niranjan Shankar

Software Engineer, Microsoft
Niranjan Shankar is a software engineer at Microsoft working on the Istio-based service mesh add-on for Azure Kubernetes Service (AKS). He has experience with multi-cluster operations, edge traffic management and security, GitOps-based patterns, and policy enforcement with Kubernetes... Read More →
Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | 155 EF
  Connectivity

3:25pm MST

Elastic Data Streaming: Autoscaling Apache Kafka - Jakub Scholz, Red Hat
Thursday November 14, 2024 3:25pm - 4:00pm MST
Autoscaling is an important part of modern cloud-native architecture. It allows applications to handle a big load at peak times while helping to optimize costs and make deployments more green and sustainable at the same time. Apache Kafka is well known for its scalability. It can grow with your project from a small cluster up to hundreds of brokers. But it was not very elastic for a long time and using dynamic autoscaling with it was very hard. This talk will guide the attendees through the main challenges of auto-scaling Apache Kafka on Kubernetes. It will show how these challenges can be solved with the help of new features added recently in Strimzi and Apache Kafka projects such as auto-rebalancing, node pools, or tiered storage. And it will help the users get started with the auto-scaling of Apache Kafka.
Speakers
avatar for Jakub Scholz

Jakub Scholz

Senior Principal Software Engineer, Red Hat
Jakub works at Red Hat as Senior Principal Software Engineer. He has long-term experience with messaging and currently focuses mainly on Apache Kafka and its integration with Kubernetes. He is one of the maintainers of the Strimzi project which provides tooling for running Apache... Read More →
Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom GI
  Data Processing + Storage

3:25pm MST

Load-Aware GPU Fractioning for LLM Inference on Kubernetes - Olivier Tardieu & Yue Zhu, IBM
Thursday November 14, 2024 3:25pm - 4:00pm MST
As the popularity of Large Language Models (LLMs) grows, LLM serving systems face challenges in efficiently utilizing GPUs on Kubernetes. In many cases, dedicating an entire GPU to a small or unpopular model is a waste, however understanding the relationship between request load and resource requirements has been difficult. This talk will study GPU compute and memory requirements for LLM inference servers, like vLLM, revealing an analytical relationship between key configuration parameters and performance metrics such as throughput and latency. This novel understanding makes it possible to decide at deployment time an optimal GPU fraction based on the model's characteristics and estimated load. We will demo an open-source controller capable of intercepting inference runtime deployments on Kubernetes to automatically replace requests for whole GPUs with fractional requests using MIG (Multi-Instance GPU) slices, increasing density hence LLM sustainability without sacrificing SLOs.
Speakers
avatar for Olivier Tardieu

Olivier Tardieu

Principal Research Scientist, Manager, IBM
Dr. Olivier Tardieu is a Principal Research Scientist and Manager at IBM T.J. Watson, NY, USA. He joined IBM Research in 2007. His current research focuses on cloud-related technologies, including Serverless Computing and Kubernetes, as well as their application to Machine Learning... Read More →
avatar for Yue Zhu

Yue Zhu

Research Scientist, IBM Research
Dr. Yue Zhu is a Research Scientist at IBM Research specializing in foundation model systems and distributed storage systems. Yue obtained a Ph.D. in Computer Science from Florida State University in 2021 and has consistently contribute to sustainability for foundation models and... Read More →
Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 EF
  Emerging + Advanced

3:25pm MST

Measuring All the Costs with OpenCost Plugins - Alex Meijer, Stackwatch
Thursday November 14, 2024 3:25pm - 4:00pm MST
The CNCF OpenCost project is approaching 5,000 stars on GitHub and has become one of the most popular cost monitoring systems in use. Originally focused on cloud provider and Kubernetes cost monitoring, OpenCost expanded its scope in May 2024 by launching OpenCost Plugins with Datadog as the first reference implementation. These plugins allow users to measure and visualize virtually any cost in OpenCost, without writing a single line of OpenCost code. Alex Meijer, OpenCost and OpenCost Plugins maintainer, will speak on how the OpenCost Plugins ecosystem works and will dive into the use of the open-source FOCUS spec in OpenCost, which is the key to being able to measure nearly any cost. A plugin-enabled OpenCost deployment will be demoed, with an external cost (Datadog) visualized alongside the traditional Kubernetes and cloud provider costs. Alex will also share how to get started with plugins so that users can start analyzing the costs of whatever matters to their unique use case!
Speakers
avatar for Alex Meijer

Alex Meijer

Staff Software Engineer, Stackwatch
Alex Meijer has been working with Kubernetes for his entire career, being at various times a user, operator, and currently as someone working to help others use Kubernetes better. He has served in startups ranging in size from 5-90 people. Alex contributes to the Opencost project... Read More →
Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom HJ
  Observability

3:25pm MST

From Chaos to Harmony, Transforming ML Engineering: A Kubernetes Adoption Journey
Thursday November 14, 2024 3:25pm - 4:00pm MST
How Ekstra Bladet’s Data Science team went from a small team of ML engineers, who needed to deliver quickly without deep technical infrastructure knowledge, to a rigid and proprietary ML pipeline built from AWS components and triggered by a large and chaotic Infrastructure as Code project. This made it difficult to achieve freedom and required a lot of work to implement and debug. One of the key reasons for adopting Kubernetes for our ML team emerged when we realized that we should serve all stakeholders across the JP/Politikens Hus organization, not just Ekstra Bladet. We then chose Kubernetes as our container infrastructure, which transformed the ML team into a dynamic ML ecosystem with great freedom under responsibility.

Initially, we focused on building robust frameworks for training and deploying ML models as API services and model training. Today, our ML team operates at the forefront of innovation, where we embrace GitOps principles to streamline our machine learning platform. Through careful adoption of advanced techniques such as autoscaling, scheduling, event triggers, and dynamic service deployment, we ensure seamless integration of new ML models into our infrastructure. This evolution has allowed us to effectively meet our diverse needs, while maintaining agility and scalability in our ML operations.
Speakers
avatar for Paris Nakita Kejser

Paris Nakita Kejser

Cloud Engineer, JP Politikens Hus
As a certified Cloud Engineer specializing in AWS and Kubernetes, I'm integral to Ekstra Bladet’s Data Science team. My focus lies in optimizing cloud infrastructure, integrating AWS and Kubernetes setups, and driving technological advancements. I contribute to Ekstra Bladet's digital... Read More →
Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom BDF
  Platform Engineering

4:30pm MST

Microsegment Your Network Like Mastercard with AdminNetworkPolicy - John Zaiss, Mastercard & Surya Seetharaman, Red Hat
Thursday November 14, 2024 4:30pm - 5:05pm MST
Do you manage Kubernetes clusters and need to enforce airtight workload security on a cluster-wide level? This is vital in the Financial Services industry to comply with the PCI Data Security Standard. Mastercard was looking for a built-in Kubernetes solution enabling admins to govern network access between workloads at scale. While exploring different options, they found namespace-scoped NetworkPolicies but wanted to avoid duplicating policies for each namespace. When Kubernetes SIG-Network added AdminNetworkPolicies in v1.25, Mastercard found what they needed! In this session, we will introduce AdminNetworkPolicy and demonstrate applying granular, non-overridable network controls on a live cluster for multi-tenant isolation. Join us to learn how Mastercard is securing microservices in production based on the principle of least privilege and zero trust. We will also share our operational challenges and lessons learnt. Attendees will gain actionable strategies to secure clusters.
Speakers
avatar for John Zaiss

John Zaiss

Principal Software Engineer, Mastercard
As a Principal Engineer, John brings extensive expertise in Kubernetes, automation, cloud identity architecture, server architecture, VMware ESX, mobile device management, and IT strategy. He is a seasoned information technology professional with a BS in Cybersecurity and a MS in... Read More →
avatar for Surya Seetharaman

Surya Seetharaman

Principal Software Engineer, Red Hat Inc.
Surya is an Open Source advocate and contributor, active in the Kubernetes SIG-Network working group. She is working as a Principal Software Engineer at Red Hat in the OpenShift Networking team. Her areas of interest include Cloud Infrastructure and Networked Services and Systems... Read More →
Thursday November 14, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 155 EF
  Connectivity

4:30pm MST

Per-Node Api-Server Proxy: Expand the Cluster's Scale and Stability - Weizhou Lan & Iceber Gu, DaoCloud
Thursday November 14, 2024 4:30pm - 5:05pm MST
For lots of CNCF projects, kinds of daemonsets simultaneously synchronize datas from the Api-server from each node. Especially in large-scale clusters, it creates significant pressure on the Api-server, burdens the network, even affects the stability of the cluster. Some projects have implemented optimization to address this. For instance, Cilium aggregates endpoint information into the CRD CiliumEndpointSlice before distributing it to its daemonset. However, many projects have not yet adopted such data aggregation optimizations and Currently, there is still no project to help improve the communication between all components and the Api-server. ClusterPedia supports to launch per-node Api-server proxies to serve all local pods, and utilize eBPF to resolve the API server's clusterIP to the local proxy, which transparently implements API server access redirection on demand. In large-scale clusters, this can significantly improve the stability of all cluster's services.
Speakers
avatar for Iceber Gu

Iceber Gu

Software Engineer, DaoCloud
Senior open source enthusiast, focused on cloud runtime, multi-cloud and WASM. I am a CNCF Ambassador and founded Clusterpedia and promoted it as a CNCF Sandbox project. I also created KasmCloud to promote the integration of WASM with Kubernetes and contribute it to the WasmCloud... Read More →
avatar for Weizhou Lan

Weizhou Lan

Senior Tech Lead, Daocloud
Weizhou Lan, 13+ years of engineering experience, engaged in kubernetes since 2018. a senior tech lead at Daocloud focusing on private cloud, a speaker at KubeCon NA/EU and KCD China, a Program Committee Member for KubeCon, the initiator and maintainer of the CNCF sandbox project... Read More →
Thursday November 14, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

4:30pm MST

Mish-Mesh: Abusing the Service Mesh to Compromise Kubernetes Environments - Hillai Ben-Sasson & Nir Ohfeld, Wiz
Thursday November 14, 2024 4:30pm - 5:05pm MST
Service mesh solutions are common components in almost every large Kubernetes environment. Many engineers and security teams have adopted solutions like Linkerd and Istio to better segment and isolate their Kubernetes networks. In this talk, we will demonstrate how we were able to exploit common misconfigurations and insecure features in popular service mesh solutions, to escalate low-severity vulnerabilities to critical service takeovers. Our real-life examples include several major cloud service providers, where these vulnerabilities allowed us to gain unauthorized access to internal systems and sensitive secrets. This talk will help engineers understand whether their service mesh deployment acts as a proper security barrier, and how to make sure that it does. Security teams – both attackers and defenders – will learn new techniques for hacking Kubernetes environments, and how to properly defend against them.
Speakers
avatar for Nir Ohfeld

Nir Ohfeld

Security Researcher, Wiz
Nir Ohfeld is a 25-years-old senior security researcher at Wiz. Ohfeld focuses on cloud-related security research and specializes in research and exploitation of cloud service providers, web applications, application security, and in finding vulnerabilities in complex high-level systems... Read More →
Thursday November 14, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 151
  Security
 
Friday, November 15
 

11:00am MST

Better Together! GPU, TPU and NIC Topological Alignment with DRA - John Belamaric, Google & Patrick Ohly, Intel
Friday November 15, 2024 11:00am - 11:35am MST
AI/ML workloads on Kubernetes demand ultra-high performance. If your training or multi-GPU inference job spans nodes, your GPUs will use the network, talking through a NIC over local PCIe. But not all NICs are equal! To get the best performance, you need a NIC which is as "close" to the GPU as possible. Unfortunately, the Kubernetes extended resources API does not have enough information and does not give you control over which specific devices are assigned. Dynamic Resource Allocation, the successor API, gives you this power. Come to this session to learn about DRA, how it is improving overall device support in K8s, and how to use it to allocate multiple GPUs, NICs, and TPUs to get the maximum performance out of your infrastructure.
Speakers
avatar for Patrick Ohly

Patrick Ohly

Principal Engineer, Intel
Patrick Ohly is a software engineer at Intel GmbH, Germany. In the past he has worked on performance analysis software for HPC clusters ("Intel Trace Analyzer and Collector") and cluster technology in general (PTP and hardware time stamping). Since January 2009 he has worked for Intel... Read More →
avatar for John Belamaric

John Belamaric

Senior Staff Software Engineer, Google
John is a Sr Staff SWE, co-chair of K8s SIG Architecture and of K8s WG Device Management, helping lead efforts to improve how GPUs, TPUs, NICs and other devices are selected, shared, and configured in Kubernetes. He is also co-founder of Nephio, an LF project for K8s-based automation... Read More →
Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 250
  AI + ML

11:00am MST

Securing Outgoing Traffic: Building a Powerful Internet Egress Gateway for Reliable Connectivity - Edie Yang & Akshita Agarwal, Airbnb
Friday November 15, 2024 11:00am - 11:35am MST
Concerned about secure and reliable outgoing traffic from your organization's mesh network? With the increasing demand to use external vendor apis for LLMs, along with vulnerabilities like Log4j, the need for preventing data exfiltration and maintaining strong safeguards is critical. But managing access to multiple external domains within the service mesh can be daunting. Discover the secrets behind building a powerful Internet Egress gateway using Istio and Envoy. This enlightening talk unveils a way to define fine-grained access policy to monitor and audit outgoing traffic from your mesh network. Besides, it demonstrates how to build a generic multi-tenant gateway that can be used across heterogeneous services and save years of repeated engineering work. By the end of the talk, attendees will gain an understanding of what an Internet Egress Gateway is, why it is necessary, and how they can configure it for their own services using the open-source Istio/Envoy based solution.
Speakers
avatar for Akshita

Akshita

Senior Software Engineer, Airbnb
Akshita is a Senior Software Engineer at Airbnb working in the Service Mesh team which the handles interservice networking at scale. She currently is focused on designing a secure network edge solution at Airbnb. Previously she worked at Microsoft developing the Nginx Load Balancer... Read More →
avatar for Edie Yang

Edie Yang

Senior Software Engineer, Airbnb
Edie is a Senior Software Engineer at Airbnb on the Cloud Infrastructure team which develops the Service Mesh system that powers the entire Airbnb stack. Edie has been working on developing service mesh API, service migration automation, Google IAP-based ingress gateway and internet... Read More →
Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | 155 EF
  Connectivity

11:00am MST

How We Scale a Distributed SQL Database to 1 PB - Sam Dillard, PingCAP
Friday November 15, 2024 11:00am - 11:35am MST
TiDB is a distributed SQL database that we built to solve the scalability problems of traditional SQL databases such as MySQL and PostgreSQL. Using TiDB, users do not need to shard their data across multiple MySQL or PostgreSQL database instances, nor do they need to sacrifice some key database features such as JOIN and transactions. Users only need to add storage nodes and computing nodes to the cluster as needed. However, we also encountered many scalability challenges when building TiKV - the stateful storage layer of TiDB. Challenges such as workload skew issues making it difficult to scale performance, management challenges of millions of dynamic data partitions, latency impact during scaling, interference between different workloads when consolidating multiple workloads into the same cluster, etc. In this talk, I will provide an in-depth look at these challenges and our solutions.
Speakers
avatar for Sam Dillard

Sam Dillard

Principle Engineer, PingCAP
Principal Engineer at PingCAP, TiKV maintainer and committer, RocksDB contributor, the author of "MariaDB Principles and Implementation". Mainly engaged in the design and development of cloud-native large-scale distributed storage systems, data platforms, 10+ years of experience in... Read More →
Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 1 | Grand Ballroom GI
  Data Processing + Storage

11:00am MST

Upgrade Safely: Avoid the Pitfalls of Kubernetes Versioning - Rob Scott, Google
Friday November 15, 2024 11:00am - 11:35am MST
Have you ever upgraded a cluster or controller only to realize everything was broken due to some kind of versioning mismatch? Do you remember the pain of upgrading to a new Kubernetes API version like Ingress v1? Do you get a little twinge any time you see a feature or API deprecated in release notes? This is the talk for you. Kubernetes versioning is surprisingly complex and widely misunderstood. This talk will cover all the relevant versioning concepts, from storage versions to feature gates. It will show how they interact with each other, and how you can use this information to safely and confidently upgrade your clusters and controllers. This talk will provide real examples of how versioning mixups can lead to broken clusters and downtime. You’ll learn exactly how you can avoid each of these potential failure modes, and gain some insights into how API and Controller authors are trying to minimize the impact of these kinds of changes in the future.
Speakers
avatar for Rob Scott

Rob Scott

Software Engineer, Google
Rob is an open source enthusiast currently working on Kubernetes Networking at Google. He's been a maintainer of Gateway API since the very early days of the project and led the development of other Kubernetes networking APIs like EndpointSlices.
Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 255 BC
  Operations + Performance

11:00am MST

Still Don't Do What Charlie Don't Does - Making CRD Changes Safer - Nick Young, Isovalent
Friday November 15, 2024 11:00am - 11:35am MST
Many Kubernetes installations use controllers that include Custom Resource Definitions (CRDs) to extend their capabilities. However, because CRDs can only have one version installed in a cluster at any one time, version and change management can be very difficult. This talk will benefit both controller implementers and users. For implementers, I have tips on how to more safely make API changes to their CRDs, and for CRD users, some tips on what to look out for when installing CRD updates. All of this is based on using experience from projects like Contour, Gateway API, and Cilium among others. Learn things like: Different CRD version management strategies - what’s worked and what hasn’t How to make schema changes like pluralizing a field or changing field validation in a safe way How not to make the same mistakes I did Expect to come away from this talk having learned from my painful experiences handling CRD changes badly, but also having heard a bunch of Simpsons references.
Speakers
avatar for Nick Young

Nick Young

Senior Software Engineer, Isovalent at Cisco
Nick has been working to prevent the entropic downfall of systems for 25 years, across datacenters, clouds, networking, and others. He's a Staff Engineer at Isovalent, and a maintainer on the Kubernetes Gateway API project, where he works on improving the ingress and mesh experiences... Read More →
Friday November 15, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 251
  Platform Engineering

11:55am MST

Improving Service Availability: Scaling Ahead with Machine Learning for HPA Optimization - Avni Sharma & Estela Ramirez, Intuit
Friday November 15, 2024 11:55am - 12:30pm MST
In this talk, we will explore employing machine learning (ML) algorithms to enhance the Kubernetes autoscaling capabilities beyond the traditional, reactive horizontal pod autoscaler (HPA). Attendees will be introduced to how to leverage recommendation algorithms to predict future load and usage patterns, allowing for smarter, proactive scaling decisions. This approach not only ensures high availability and responsiveness of applications but also offers a pathway to substantial cost optimizations by preventing over-provisioning and minimizing resource wastage.
Speakers
avatar for Avni Sharma

Avni Sharma

Product Manager, Intuit
Avni is a Product Manager at Intuit, working on Intuit’s Modern SaaS Kubernetes platform. She also worked on ArgoCD as a PM. Avni is passionate about Developer tooling and strives to make developers' life easier by delivering them delightful experiences. She is also an Open Source... Read More →
avatar for Estela Ramirez

Estela Ramirez

Software Engineer, Intuit Kubernetes Service, Intuit
Estela is a Software Engineer at Intuit focusing on Intuit Kubernetes Developer Platform. She works on abstracting the autoscaling for developers.
Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

11:55am MST

Seeing Double? Implementing Multicast with eBPF and Cilium - Louis DeLosSantos, Isovalent at Cisco
Friday November 15, 2024 11:55am - 12:30pm MST
Multicast is a popular networking technology used in finance, telecommunications, and media CDNs, among others to efficiently replicate and deliver data streams to multiple clients. However, this advantage can be overshadowed by the complexity involved in configuring the necessary infrastructure leaving the overworked platform team rather than the end users seeing double. To combat this complexity, Cilium explored using eBPF to implement pod-to-pod multicast delivery within a Kubernetes cluster. This talk will provide both a high and low level understanding of how eBPF can be used to implement multicast delivery. It will discuss how Cilium’s multicast works and the hurdles faced by the project along the way. By the end of this talk the audience will have a better understanding of how multicast functions, how eBPF can be used in-place of traditional multicast infrastructure, and how Cilium can be used as a multicast-enabled CNI, letting your audience - and not you- see double.
Speakers
avatar for Louis De Los Santos

Louis De Los Santos

Louis DeLosSantos, Isovalent at Cisco
Louis DeLosSantos is a multi-disciplined technologist who has worn network, systems, and software engineer hats at various times. Presently he works at Isovalent at Cisco where he focuses on Linux Kernel networking and implementing eBPF datapath networking solutions.
Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 155 EF
  Connectivity

11:55am MST

Kubernetes on Multisites – A Story About Stateful App, Hybrid Clouds, and High Availability - Florian Coulombel, Dell Technologies & Jan Šafránek, Red Hat
Friday November 15, 2024 11:55am - 12:30pm MST
The day has come! Kubernetes has won the hearts and minds of your leadership and entire organizations, and everyone wants to benefit. Projects are launched to migrate legacy apps, run proprietary systems, and even use virtual machines in your Kubernetes infrastructure! But wait a minute. VMs and good' ol RDBMS are not microservices developed with 12 factors in mind where data is either hosted on an external service or replicated by the application. How are we going to warranty the availability of these applications and systems? Do I need to do a backup of these things? What if my business is fragmented across edge, on-prem, and public clouds? Members from SIG Storage will guide you through the options to compose with, including the latest CSI features, Kubernetes architecture design, and even hardware solutions. We will evaluate the benefits to consider and the pitfalls to avoid when implementing stateful workloads in Kubernetes on multiple sites.
Speakers
avatar for Jan

Jan

Software Engineer, Red Hat
Jan is a Senior Principal Software Engineer at Red Hat working on storage aspects of Kubernetes. He started developing Kubernetes more than 8 years ago, and is one of the founding members of SIG-Storage. He’s the author of PersistentVolume controller, dynamic provisioning and StorageClass... Read More →
avatar for Florian Coulombel

Florian Coulombel

Senior Software Engineer, Dell Technologies
Father of 2, living in France. Nerd since 1996 when Quake alpha version leaked, Linux user since 2001, Kubernetes enthusiast since 2016, member of Kubernetes SIG Storage since 2023.
Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | Grand Ballroom GI
  Data Processing + Storage

11:55am MST

Strategies for Mitigating Performance Interference in Cloud-Native Systems - Jonathan Perry, Startup
Friday November 15, 2024 11:55am - 12:30pm MST
In cloud-native environments, application performance often degrades due to contention over shared resources such as CPU caches and memory bandwidth. Current container technologies lack mechanisms to isolate these resources, which compels operators to maintain low utilization by scaling out their deployments. This session explores strategies used by hyperscalers like Google, Microsoft, Facebook, and Alibaba to mitigate such performance interference. We will review their published methodologies, extracting key principles that could guide the development of a Kubernetes-native performance isolator. Participants will gain insights into the design trade-offs and operational impacts of these tools. Additionally, we will discuss integration strategies for deploying such isolators in existing Kubernetes environments, aiming to optimize resource utilization while preserving application performance.
Speakers
avatar for Jonathan Perry

Jonathan Perry

Founder, State-fu
Jonathan Perry is a maintainer of the OpenTelemetry eBPF network collector. His PhD research at MIT CSAIL focused on performance isolation in datacenter and cloud networks, aiming to enhance network efficiency and reduce latency. Jonathan founded Flowmill, where he developed eBPF-based... Read More →
Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

11:55am MST

What Containerd 2.0 Means for You - Samuel Karp, Google
Friday November 15, 2024 11:55am - 12:30pm MST
containerd 2.0 is the first major new version of containerd since 1.0.0 was released in 2017. This new version of containerd introduces new features, new extension points, and new backends for image operations and CRI with the goal of increased flexibility and better efficiency for certain types of workloads. containerd 2.0 also removes some previously-deprecated features in favor of modern replacements. This talk will discuss how to prepare for containerd 2.0 in your production environments, including strategies for incorporating containerd 2.0's new functionality and detecting/remediating any impact of removed features prior to upgrading.
Speakers
avatar for Samuel Karp

Samuel Karp

Staff Software Engineer, Google
Samuel Karp is a containerd maintainer and a Staff Software Engineer at Google, focused on the container runtime for Google Kubernetes Engine. Sam has been involved in the container ecosystem since 2014 and serves as the Chair of the Open Container Initiative's Technical Oversight... Read More →
Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 255 BC
  Operations + Performance

11:55am MST

Share the Ride: Robust Multi-Tenancy in Kubernetes at Uber - Sashank Appireddy & Apoorva Jindal, Uber
Friday November 15, 2024 11:55am - 12:30pm MST
Multi-tenancy in Kubernetes involves the coexistence of multiple users or teams (tenants) on a single Kubernetes cluster while ensuring isolation, security, and performance. Our use cases at Uber span from scenarios with disruptive neighbors to those with large container sizes, specialized hardware, sticky placement preferences, and dynamic resource scaling demands, necessitating robust isolation measures. In this proposal, we present a comprehensive exploration of multi-tenancy in Kubernetes, covering strategies, the challenges we have faced and the effective solutions implemented to overcome them at Uber. Further, we will deep dive into the key aspects of building and managing multi-tenant Kubernetes clusters, by establishing strong tenant boundaries leveraging the ideas around node pools and tightly integrating with namespaces.
Speakers
avatar for Apoorva Jindal

Apoorva Jindal

Senior Staff Software Engineer, Uber Inc
Apoorva Jindal is working as Senior Staff Software Engineer at Uber. At Uber, he leads the Compute platform which powers all stateless and batch containerized workloads at Uber.
avatar for Sashank Reddy

Sashank Reddy

Staff Software Engineer, Uber Technologies Inc
I am software engineer with over a decade of experience specializing in containerization and distributed systems. As a Staff Software Engineer in the container platform team at Uber Technologies Inc, I lead the design, development and deployment of scalable multi-tenant architecture... Read More →
Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 251
  Platform Engineering

11:55am MST

Rogue No More: Securing Kubernetes with Node-Specific Restrictions - Anish Ramasekar, Microsoft & James Munnelly, Apple
Friday November 15, 2024 11:55am - 12:30pm MST
Did you know that a component running across multiple nodes, such as in a daemonset, intended to perform node-specific actions, can pose a significant security risk? If any node the component is running on goes rogue, it can lead to attacks on the cluster, or even worse, a complete takeover of it. What if we could restrict the component's ability to write resources only to those belonging to the node it is running on to prevent such escalation attacks? In this talk, Anish and James will introduce new Kubernetes security enhancements to bound service account tokens, which can be used with validating admission policies to enforce per-node restrictions on service accounts. This session will provide you with practical implementation guidelines and show you how these enhancements can mitigate risks and protect your infrastructure with robust node isolation.
Speakers
avatar for James Munnelly

James Munnelly

Staff Field Engineer, Apple
James Munnelly is a Field Engineer at Apple, helping customers adopt and adapt Kubernetes, and driving adoption of OSS cloud native technologies. James is also the founder of the cert-manager project, a Kubernetes extension for managing x509 certificates. He's an active member of... Read More →
avatar for Anish Ramasekar

Anish Ramasekar

Principal Software Engineer, Microsoft
Anish Ramasekar is a software engineer at Microsoft. He is on the Azure Container Upstream team building features for Kubernetes upstream and various CNCF projects that are part of the Azure Kubernetes Service. Anish is a maintainer of the Secrets Store CSI Driver project.
Friday November 15, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | 151
  Security

2:00pm MST

Bloomberg’s Journey to Improve Resource Utilization in a Multi-Cluster Platform - Yao Weng & Leon Zhou, Bloomberg
Friday November 15, 2024 2:00pm - 2:35pm MST
Bloomberg provides an on-premises Data Science Platform (DSP) using cloud-native software to support internal AI model training. It runs on Kubernetes clusters spanning multiple data centers and featuring a diverse range of GPU types. However, managing such a large-scale and heterogeneous GPU environment poses many challenges, such as improving resource utilization, reducing operational costs, and scheduling workloads across different GPU types. In collaboration with the Karmada community, Bloomberg's DSP team has aimed to tackle these challenges by addressing multi-cluster batch job management problems. This talk will delve into the approaches the team has adopted, including: - Intelligently scheduling GPU workloads across multiple clusters - Using Karmada's resource interpreter to support Kubernetes Custom Resource Definitions (CRDs) on top of a multi-cluster architecture - Building a highly available Karmada control plane - Establishing a consistent training job submission interface
Speakers
avatar for Leon Zhou

Leon Zhou

Software Engineer, Bloomberg
Leon Zhou is a software engineer on the Data Science Platform engineering team at Bloomberg. With prior NLP experience, he is now building ML platforms to facilitate machine learning development. He is interested in ML infrastructure to enable large-scale training and complex pipelines... Read More →
avatar for Yao Weng

Yao Weng

Senior Software Engineer, Bloomberg
Yao Weng is a Senior Software Engineer on Bloomberg’s Data Science Platform engineering team. She has contributed extensively to optimizing the company’s Kubernetes environment for high performance compute, model inference, and workflow orchestration. Yao Weng obtained her Ph.D... Read More →
Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 250
  AI + ML

2:00pm MST

Testing Kubernetes Without Kubernetes: A Networking Deep Dive - John Howard, Solo.io
Friday November 15, 2024 2:00pm - 2:35pm MST
There are few things more tedious than waiting for a long end-to-end test to run. Waiting for a new cluster to spin up, images to build and push - not to mention things like debugging or running on slow internet connections. Unfortunately, these complex setups are hard to avoid, especially if we are testing things deeply integrated into Kubernetes networking, such as CNIs, kube-proxy, services meshes, and more. It doesn't have to be this way! In this talk, I will give a deep dive on how we built out our testing strategy for our Kubernetes networking proxy to not really depend on Kubernetes (or docker, or root). In doing so, I will not only offer a glimpse behind the scenes of Istio development, but also give viewers a deeper understand of how the fundamentals of Kubernetes (Linux primitives like namespaces) work, and how they can be effectively used to improve tests in the Istio ecosystem and beyond.
Speakers
avatar for John Howard

John Howard

John Howard, Solo.io
John Howard is a Senior Architect at Solo.io and Istio Technical Oversight Committee member.
Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | 155 EF
  Connectivity

2:00pm MST

Object Storage Is All You Need - Justin Cormack, Docker
Friday November 15, 2024 2:00pm - 2:35pm MST
When Jeff Bezos commissioned Amazon S3 he called it "malloc for the web"; since then many people have considered cloud object storage to be a weird kind of non Posix filesystem, but also a great backing store for websites or storing lots of data. Recently more and more applications are being built with object storage as the entire persistence layer. This started with analytics databases such as Snowflake and Databricks, and the open source Delta Lake and Apache Iceberg projects. More recently the use is spreading to even more applications, from observability to streaming data and more. In this talk we look at why it is becoming so popular, the benefits, downsides and performance characteristics, and how and when to use it effectively.
Speakers
avatar for Justin Cormack

Justin Cormack

CTO, Docker
Justin is the CTO of Docker, recently a member of the CNCF TOC, and has been working in the container ecosystem and in supply chain security for many years.
Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | Grand Ballroom GI
  Data Processing + Storage

2:00pm MST

Faster Containerized LLM Serving via Knowledge Sharing - Junchen Jiang, University of Chicago & Zhou Sun, Mooncake Labs
Friday November 15, 2024 2:00pm - 2:35pm MST
Imagine once an LLM learns something from a document, the knowledge can be instantly shared with other LLMs. Unfortunately, today, LLMs must read the same document multiple times, causing a significant slowdown. This session will introduce a new KNOWLEDGE-SHARING system that enables LLMs to share their digested knowledge, in the form of KV caches, so only one LLM needs to process each document. The key challenge is how to store the KV caches cheaply and serve them quickly. Instead of keeping the KV caches of all reusable chunks in GPU/CPU memory, we show a DEMO that with careful implementation on Kubernetes, storing them on cheaper devices is not only economically superior but also delivers significant reductions in LLM serving delay, especially the time to the first token.
Speakers
avatar for Junchen Jiang

Junchen Jiang

Professor, University of Chicago
Junchen Jiang is an Assistant Professor of Computer Science at the University of Chicago. He works at the intersections between networked systems and machine learning. He received his Ph.D. from CMU in 2017 and his bachelor’s degree from Tsinghua in 2011. He has received a Google... Read More →
avatar for Zhou Sun

Zhou Sun

CEO, Mooncake Labs
Mooncake Labs is working on the next generation of stateless data architecture, bringing database performance and functionality to structured and unstructured data in datalakes and raw datasets. Previous I lead the query team at SingleStore (cloud-native distributed HTAP database... Read More →
Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 255 EF
  Emerging + Advanced

2:00pm MST

Supercharge Your Kubernetes Autoscaling with Custom Metrics - Vamshi Krishna Samudrala & Sravan Akinapally, American Airlines
Friday November 15, 2024 2:00pm - 2:35pm MST
Out-of-the-box, Kubernetes provides native horizontal scaling capabilities driven by conventional resource consumption signals like CPU and memory utilization. However, in the real world, numerous applications demand dynamic scaling orchestrated by custom business telemetry such as queue depths, throughput volumes, or other domain-specific indicators. This session will unravel the secrets of extending Kubernetes' Horizontal Pod Autoscaler (HPA) to leverage custom metrics as scaling triggers, unlocking unprecedented scaling autonomy. Attendees will witness live demos showcasing: Deploying a custom metrics provider to expose application-centric metrics to the Kubernetes control plane Configuring the HPA to consume these custom metrics for intelligent scaling decisions A sample application dynamically scaling based on a custom metric like queue length or requests per second Best practices for crafting bespoke scaling policies tailored to custom metrics.
Speakers
avatar for Vamshi krishna Samudrala

Vamshi krishna Samudrala

Enterprise Cloud Architect, American Airlines
Enterprise Architect with a distinguished career spanning 14 years in the fields of DevOps and Cloud Architecture. Focused on automation, configuration management and innovation with cutting-edge technologies.Worked extensively with leading cloud service providers, including Amazon... Read More →
avatar for Sravan Akinapally

Sravan Akinapally

Product Tech Lead, American Airlines
Product Tech Lead
Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

2:00pm MST

Micro-Segmentation and Multi-Tenancy: The Brown M&Ms of Platform Engineering - Jim Bugwadia, Nirmata & Rachael Wonnacott, Fidelity International
Friday November 15, 2024 2:00pm - 2:35pm MST
A key requirement for internal developer platforms is that they serve multiple workloads. The reality of platform engineering is that while it seeks to lower the barrier to entry for teams to deliver applications, it must also balance cost and ensure appropriate levels of security. It’s therefore essential to consider how application components running on shared infrastructure are allowed to communicate with each other and weigh up the cost of each architecture. In industry, we have seen differing approaches to deploying Kubernetes to achieve these goals, from multiple single-tenant clusters through to shared clusters that deliver namespaces-as-a-service. Rachael and Jim will define the concepts of multi-tenancy and micro-segmentation for cloud native systems, explain why they are critical to success with platform engineering. They will also show real-world examples of how they can be implemented, and demonstrate full automation using best practices like GitOps and Policy as Code.
Speakers
avatar for Jim Bugwadia

Jim Bugwadia

Co-founder and CEO, Nirmata
Jim Bugwadia is a co-founder and the CEO of Nirmata, the Kubernetes policy and governance company. Jim is an active contributor in the cloud native community and currently serves as co-chair of the Kubernetes Policy and Multi-Tenancy Working Groups. Jim is also a co-creator and maintainer... Read More →
avatar for Rachael Wonnacott

Rachael Wonnacott

Technical Product Owner, Kubernetes Platform, Fidelity International
Rachael has spent the last decade focused on platform engineering. She places a conscious emphasis on improving flow and is on the quest to smooth the application lifecycle for developers in the enterprise. With a background in astrophysics, Rachael brings her scientific approach... Read More →
Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 1 | Grand Ballroom BDF
  Platform Engineering

2:00pm MST

The Missing Talk About API Versioning & Evolution in Your Developer Platform - Stefan Schimanski, Upbound & Sergiusz Urbaniak, Independent
Friday November 15, 2024 2:00pm - 2:35pm MST
In the realm of developer platforms, individuals without extensive experience in the cloud-native ecosystem are now venturing into the creation of Kubernetes-based APIs. Tools like Crossplane are transforming every platform engineer into an API designer. Ten years in, the ecosystem still offers little guidance on Kubernetes versioning and API evolution in practice. A naive understanding is not helpful, and many have been burned by relying on intuition. This talk will provide deep, yet applicable knowledge, starting from the first principles of the invariants to maintain when changing APIs in Kubernetes. It will cover tools like schemas, conversion, validation, and admission, and present very concrete and directly applicable API Evolution Patterns. These patterns will help navigate the life cycle of CRD-based projects. This talk aims to educate on how to evolve APIs effectively and safely without inadvertently breaking users.
Speakers
avatar for Sergiusz Urbaniak

Sergiusz Urbaniak

Team Lead - Kubernetes, https://mongodb.com
Sergiusz is a Kubernetes Team Lead at MongoDB. He is enthusiastic about modern infrastructure software while still enjoying minimalistic networking techniques like morse code. He worked on Mesos, container runtimes, Prometheus Operator, Thanos, upstream Kubernetes, Operators, and... Read More →
avatar for Stefan Schimanski

Stefan Schimanski

Senior Principal Software Engineer, Upbound
Stefan is a Senior Principal Engineer at Upbound working on control planes, Kubernetes, kcp, and as a tech-lead in Sig API Machinery. He contributed a major part of the CRD feature set. Stefan is a 2nd time GoogleSummer of Code mentor with CNCF, loves to teach and help people to learn... Read More →
Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 251
  Platform Engineering

2:00pm MST

The Policy Engines Showdown - Gabriel L. Manor, Permit.io; Andres Aguiar, Okta; Omri Gazitt, Aserto; Anders Eknert, Styra; Sarah Cecchetti, AWS
Friday November 15, 2024 2:00pm - 2:35pm MST
OPA, Cedar, OpenFGA, Topaz, OPAL, OSO, should I continue? Policy engines, languages, and standards are everywhere, making the decision for a good decision engine increasingly difficult. In this panel, I'll host four talented engineers, each from a different policy engine's core team, for a friendly showdown. We will assist the audience in making the most important decision - choosing a suitable and fitting decision engine for their specific use case. We will also delve into the nuances of running multiple engines together and learn how to scale them properly.
Speakers
avatar for Sarah Cecchetti

Sarah Cecchetti

Head of Product, Cedar, AWS
Sarah is the Head of Product for Cedar Policy Language, an open-source project designed to express permissions in an easy-to-read and fast-to-execute format. She co-founded a professional organization for identity practitioners called IDPro. She is a contributor to NIST 800-63-C Digital... Read More →
avatar for Anders Eknert

Anders Eknert

Develeper Relations Lead, Styra
Developer advocate at Styra with a long background in software development, security and identity systems in primarily distributed environments. When not in front of his computer he enjoys watching football, cooking and Belgian beers.
avatar for Gabriel Manor

Gabriel Manor

Director of DevRel, Permit.io
Gabriel is a senior full-stack developer who blends his passion for technical leadership, security, authorization, and devtools into his current role as the Head of Growth and DevRel at Permit.io. Before joining Permit.io, Gabriel worked as a technical leader and principal engineer... Read More →
avatar for Omri Gazitt

Omri Gazitt

Co-founder & CEO, Aserto
Omri is the co-founder/CEO of Aserto, an authorization startup, and his third entrepreneurial venture. He's spent the majority of his 30-year career working on developer and infrastructure technology, most recently as the CPO of Puppet. Previously he was the VP and GM of HP's Cloud... Read More →
avatar for Andres Aguiar

Andres Aguiar

Product Manager, Okta
Andres has spent his 20+ year career building tools for developers, wearing different hats. He’s been working on the identity space for the last 6 years, and is currently the Product Manager for OpenFGA.
Friday November 15, 2024 2:00pm - 2:35pm MST
Salt Palace | Level 2 | 255 BC
  Security

2:00pm MST

Tutorial: Simplify and Optimize Your YAML with YAMLScriptb - Ingy döt Net, YAML LLC
Friday November 15, 2024 2:00pm - 3:30pm MST
Nobody likes YAML (or anything for that matter) when its a giant and repetitive mess. Of course, there are already existing technologies like Helm and Kustomize that help provide make YAML nicer for Kubernetes. The new kid on the block is YAMLScript. Being a complete programming language (built over a vast and mature ecosystem) its capabilities are effectively limitless. That said, its primary focus is on refactoring and improving existing and new large YAML configurations. YAMLScript can help you make the most of YAML in any domain; even those that already make great use of Helm and Kustomize. Having been created by an original inventor and current lead maintainer of the YAML data language (Ingy döt Net) you can count on it meshing well with the YAML you already know. In this hands on interactive tutorial, Ingy will teach you how to make the most of YAML and YAMLScript.
Speakers
avatar for Ingy döt؜؜ Net­

Ingy döt؜؜ Net­

Ingy döt Net, YAML LLC
Ingy döt Net is one of the original inventors of the YAML data language, and its primary maintainer. He has continuously contributed to Open Source efforts since before it was called Open Source. His passion is creating software libraries that work in as many programming languages... Read More →
Friday November 15, 2024 2:00pm - 3:30pm MST
Salt Palace | Level 1 | Grand Ballroom ACE

2:55pm MST

How GoTo Financial Automates Upgrading 60+ Istio Service Mesh Seamlessly! - Didi Yudha Perwira & Zufar Dhiyaulhaq, GoTo Financial
Friday November 15, 2024 2:55pm - 3:30pm MST
Istio, one of the most popular service meshes, is widely used by many companies. Service meshes simplify observability, traffic management, security, and policy on Kubernetes. While they offer significant benefits, day-to-day operations like upgrades can be challenging. These upgrades require active monitoring during the process. GoTo Financial, for instance, took more than 45 days to upgrade 60+ clusters. This talk will share their journey of building an open-source, opinionated automation solution to simplify the Istio service mesh upgrade process. This solution has shortened upgrade time to 14 days, reduced active monitoring, and frees up valuable engineering resources and minimized downtime risks.
Speakers
avatar for Zufar Dhiyaulhaq

Zufar Dhiyaulhaq

Engineering Manager, GoTo Financial
Zufar recently joins Gojek as Cloud Platform Engineer, He has been in the IT industry for 3 years, mostly working with Linux, Cloud, and Kubernetes. He also loves to contribute to open source projects like Istio and help to organize CNCF meetups in Indonesia.
avatar for Didi Yudha Perwira

Didi Yudha Perwira

Sr. Software Engineer, GoTo Financial
Didi has been working in GoTo Financial for 3 years and he has been working for Kubernetes and Istio since the day 1 he's working in GoTo Financial. Didi also have experience and passionate in software engineering field, usually he codes Golang, Javascript, Typescript and Python... Read More →
Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 2 | 255 EF
  Connectivity

2:55pm MST

Thousands of Gamers, One Kubernetes Network - Surya Seetharaman, Red Hat & Girish Moodalbail, NVIDIA Inc
Friday November 15, 2024 2:55pm - 3:30pm MST
Uninterrupted gameplay with minimal network latency, jitter, and maximum throughput is crucial for a great gamer experience. But how do we maintain consistent network quality in cloud gaming production environments at NVIDIA when 2K+ players (pods) share the same physical network for game storage and streaming? When a new player joins and a pod starts downloading large contextual game data, it is vital to shield other players on the same node from this 'noisy neighbor'. Kubernetes provides limited pod-level traffic shaping but we needed more than that. In this talk we will show how we achieved true Quality of Service and wire-speed networking on Kubernetes clusters using Differentiated Services Code Point (RFC7657) markings on pod traffic. Through a live demo that will involve a noisy pod and a victim pod, attendees will gain actionable insights and best practices around packet-parameter-tuned traffic shaping using simple Kubernetes Custom Resources to optimize network performance.
Speakers
avatar for Girish Moodalbail

Girish Moodalbail

Distinguished Engineer, NVIDIA Inc, NVIDIA Inc
Girish Moodalbail, a Distinguished Engineer at Nvidia Inc., builds Kubernetes-based GPU compute for gaming, AI training, and inferencing with low-latency, high-throughput, reliable, scalable, and secure networking using OSS (OVS, OVN, OVN-K8s CNI) and NVIDIA hardware. With over 22... Read More →
avatar for Surya Seetharaman

Surya Seetharaman

Principal Software Engineer, Red Hat Inc.
Surya is an Open Source advocate and contributor, active in the Kubernetes SIG-Network working group. She is working as a Principal Software Engineer at Red Hat in the OpenShift Networking team. Her areas of interest include Cloud Infrastructure and Networked Services and Systems... Read More →
Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 1 | 155 EF
  Connectivity

2:55pm MST

Object, Block, or File Storage? Choosing the Right Cloud Storage to Integrate Into Kubernetes - Mitch Becker & Tom McDonald, Amazon Web Services (AWS)
Friday November 15, 2024 2:55pm - 3:30pm MST
This presentation helps simplify the container storage landscape to assist K8s users make educated cloud storage choices based on their workload requirements and data strategy. You already know K8s is a an open-source platform that orchestrates containerized applications. But what type of cloud storage should one deploy for stateless and stateful applications to ensure persistent data across various operational scenarios? Different storage types cater to specific use cases within K8s environments. Organizations often require persistent storage to run K8s for stateful use cases such as Large-Scale Application Deployment, High-Performance Computing (HPC), AI/ML, Microservices Management, CI/CD Pipelines, and Big Data Processing. Because Block, File, and Object Storage are used in varying ways for containerized workloads, this talk will explain use cases for each storage type and educate the attendees so their selection of storage supports their applications and overall data strategy.
Speakers
avatar for Tom McDonald

Tom McDonald

Sr. Storage Specialist SA, AWS
Tom McDonald is a Senior Workload Storage Specialist at AWS. Starting with an Atari 400 and re-programming tapes, Tom began a long interest in increasing performance on any storage service. With 20 years of experience in the Upstream Energy domain, file systems and High-Performance... Read More →
avatar for Mitch Becker

Mitch Becker

Sr. Storage Specialist, Amazon Web Services (AWS)
Accomplished cloud professional transforming and modernizing IT environments: Cloud Computing, Cloud Storage, HPC, AI, Containers, DevOps, & Cloud Adoption/Migration/Transformation. • CNCF Storage Technical Advisory Group Member • AWS --- Certified Cloud Practitioner, Industry... Read More →
Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 1 | Grand Ballroom GI
  Data Processing + Storage

2:55pm MST

This Platform Goes to 11: Boost Developer Productivity with Lessons from Salesforce - Joe Kutner, Salesforce
Friday November 15, 2024 2:55pm - 3:30pm MST
Internal platforms play an essential role in boosting the productivity of developers who use cloud native technologies. That’s why Salesforce, a global leader in the cloud for more than two decades, evolved its existing collection of managed services and capabilities into a cohesive platform that delights developers. In this talk, you’ll learn how Salesforce's platform removes friction, unifies interfaces, and meets developers where they are with industry standard tooling. As you design and build your own platforms, you’ll be able to use the same principles that guided Salesforce to accelerate day-1 onboarding of new apps, increase the speed of the developer inner-loop and testing cycles, and reduce the time it takes to deliver new code to production. Our lessons learned will help you avoid missteps. Finally, you’ll learn how to measure developer satisfaction, performance, activity, collaboration, and efficiency to ensure that your platform delivers the most value for your developers.
Speakers
avatar for Joe Kutner

Joe Kutner

Software Architect, Salesforce
Joe is co-founder of the Cloud Native Buildpacks project, which aims to make containerization more secure and more developer friendly. He started the project in 2018 while working as DX Architect at Salesforce Heroku, and today is the DX Architect for Salesforce’s Hyperforce platform... Read More →
Friday November 15, 2024 2:55pm - 3:30pm MST
Salt Palace | Level 2 | 251
  Platform Engineering

4:00pm MST

Divide and Conquer: Master GPU Partitioning and Visualize Savings with OpenCost - Kaysie Yu & Ally Ford, Microsoft
Friday November 15, 2024 4:00pm - 4:35pm MST
Kubernetes is the ideal platform for running AI and ML workloads, such as LLMs. GPU nodes are often used for their parallel processing capabilities and higher performance benefits; however, they are known to be costly. Many factors impact the cost of running AI/ML workloads such as GPU utilization, GPU VM size, idle time, etc. These costs are often ignored and considered inherent in running GPU workloads. But if running workloads at scale and left unoptimized, costs will quickly spin out of control. In this talk, we leverage NVIDIA DCGM exporter with Prometheus for GPU metrics monitoring alongside OpenCost to measure the Kubernetes spend of our GPU workloads. We will provide an overview of OpenCost, highlighting its role in bridging the gap between the developer and platform teams through visibility and accountability of spend. We will demonstrate how to use the NVIDIA GPU Operator and how techniques such as partitioning can lead to significant cost savings.
Speakers
avatar for Ally Ford

Ally Ford

Product Manager, Microsoft
Ally is a Product Manager on the Azure Kubernetes Service (AKS) team at Microsoft Azure. She spends her days collaborating with customers to design features that improve the end to end operator experience for both Linux and Windows users. Formerly she was a UX designer and project... Read More →
avatar for Kaysie

Kaysie

Product Manager, Microsoft
Kaysie Yu is a Product Manager on the Azure Kubernetes Service team at Microsoft. She works on cost management and optimization and is passionate about the convergence of FinOps and GreenOps, advocating for best practices that help organizations achieve cost efficiency while contributing... Read More →
Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

4:00pm MST

Topology Aware Routing: Understanding the Tradeoffs - Rob Scott, Google
Friday November 15, 2024 4:00pm - 4:35pm MST
In Kubernetes 1.31, a new TrafficDistribution field on Services graduated to beta. This is effectively our third attempt at solving Topology Aware Routing in Kubernetes. This talk will tell the story of how we got here and what we learned along the way, outlining what exactly has made this problem so surprisingly complex. With that context, we’ll dive into exactly how Traffic Distribution works today, and when you should configure it. You’ll learn about how it’s implemented today, and how better implementations may be written in the future. We'll walk through some examples to show how it can work well, and when it may not. Finally, we’ll cover how this concept will interact with autoscaling, load balancers, Ingresses, Gateways, and Multi-Cluster Services. You should leave this talk with a clear understanding of how Topology Aware Routing works in Kubernetes, when to use it, and a broad awareness of the work that’s still in progress in this space.
Speakers
avatar for Rob Scott

Rob Scott

Software Engineer, Google
Rob is an open source enthusiast currently working on Kubernetes Networking at Google. He's been a maintainer of Gateway API since the very early days of the project and led the development of other Kubernetes networking APIs like EndpointSlices.
Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | 155 EF
  Connectivity

4:00pm MST

The Node Tetris Rabbit Hole: Why Your Binpacking Might Be Underperforming - Hannah Taub, Adobe Inc.
Friday November 15, 2024 4:00pm - 4:35pm MST
Have you ever looked at your Kubernetes cluster and thought “I have a perfectly good autoscaler! Why are all my nodes at less than 50% capacity?” When a team moves to the scale of hundreds of clusters with thousands of nodes, efficient binpacking changes from a side task to a financial necessity. From inefficient client apps to long-buried cluster configs, follow the Adobe Ethos team as they track down leads on what’s causing cluster underutilization and how to fix it. You will also learn some tips for designing your clusters to avoid these issues in the first place.
Speakers
avatar for Hannah Taub

Hannah Taub

Ms., Adobe Inc.
As a senior software engineer, Hannah has been working with Adobe’s Cloud Cost Efficiency team for the past several years. After graduating from the University of Edinburgh, she went from writing content APIs at Viacom (now Paramount) to building out Adobe’s Ethos Kubernetes CI/CD... Read More →
Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

4:00pm MST

Migratory Patterns: Making Architectural Transitions with Confidence and Grace - Pete Hodgson, PartnerSlate
Friday November 15, 2024 4:00pm - 4:35pm MST
Big technical migrations - like switching databases - can feel like you're swapping out the engine of a bus while continuing to drive down the freeway (with all your users screaming in the back). However, there are ways to make these transitions safe, incremental, low-stress. In this talk we'll walk through a real-world case study of switching a production system from one database to another with no downtime, and no tears, using techniques like Expand/Contract, Dark Launch and Parallel Run. We'll also see hands-on examples of using CNCF open standards like Open Feature and Open Telemetry to manage this migration.
Speakers
avatar for Pete Hodgson

Pete Hodgson

CTO, PartnerSlate
Pete Hodgson is an independent software delivery consultant. He helps engineering teams to level up and tackle their thorniest challenges, with a focus on agile engineering practices, architectural evolution, and lean process management. Prior to going independent he spent several... Read More →
Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 255 EF
  SDLC

4:00pm MST

Why Perfect Compliance Is the Enemy of Good Kubernetes Security - Michele Chubirka, Google
Friday November 15, 2024 4:00pm - 4:35pm MST
Technology organizations often struggle over who should manage the security of their Kubernetes environment. This task usually falls to platform or cloud engineering teams, but they often feel abandoned by their security counterparts, uncertain of which requirements will deliver real security value. While published benchmarks and security guides for Kubernetes are helpful, not all recommendations work for every use-case. They may require Kubernetes alpha or beta features which could cause issues with platform stability. Our desire to prioritize “perfect” security over having a functional platform that addresses relevant risks can leave us with nothing, frustrating everyone. Kubernetes is meant to increase application delivery velocity, but when overly strict compliance prevents a team from moving forward, they will subvert security requirements. Let’s stop obsessing over the red in our security and compliance dashboards and focus on what adds real value by reducing risk.
Speakers
avatar for Michele Chubirka

Michele Chubirka

Cloud Security Advocate, Google
Michele Chubirka is a recovering Unix and network engineer currently working as a cloud security advocate for Google. She has been an architect, podcaster and freelance writer for various B2B publications such as Network Computing, Dark Reading and TechTarget. She likes long walks... Read More →
Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 255 BC
  Security

4:55pm MST

Best of Both Worlds: Integrating Slurm with Kubernetes in a Kubernetes Native Way - Eduardo Arango Gutierrez, NVIDIA & Angel Beltre, Sandia National Laboratories
Friday November 15, 2024 4:55pm - 5:30pm MST
It's not always clear which container orchestration system is best suited for a given use case. Slurm, for example, is often preferred over Kubernetes when running large-scale distributed workloads. As a result, organizations areoften faced a hard choice: do they deploy Slurm or Kubernetes to service the rising demands of their AI/ML workloads. In this talk, we introduce K-Foundry, an open-source custom controller for KCP that translates Kubernetes jobs to Slurm jobs and exposes Slurm nodes and cluster info as Kubernetes Custom Resource Definitions (CRDs). This integration combines Slurm’s robust job scheduling with Kubernetes' dynamic orchestration and API-driven ecosystem, easing the administration of both clusters through a common API. This session will end with a live demo, where attendees will see how this integration bridges the gap between cloud and HPC, facilitating resource management and optimizing performance for large-scale AI and LLM tasks.
Speakers
avatar for Eduardo Arango Gutierez DE

Eduardo Arango Gutierez DE

Senior systems software engineer, NVIDIA
Eduardo is a Senior Systems Software Engineer at NVIDIA, working on the Cloud Native Technologies team. Eduardo has focused on enabling users to build and deploy containers on distributed environments.
avatar for Angel Beltre

Angel Beltre

Senior Member of Technical Staff, Sandia National Laboratories
Angel Beltre serves as a senior member of the technical staff within the Scalable System Software department at Sandia National Laboratories. He is a contributor to the CSSE Computing-as-a-Service (CaaS) initiative, aimed at streamlining the deployment of modeling and simulation tools... Read More →
Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 250
  AI + ML

4:55pm MST

Distributed Multi-Node Model Inference Using the LeaderWorkerSet API - Abdullah Gharaibeh & Rupeng Liu, Google
Friday November 15, 2024 4:55pm - 5:30pm MST
Large Language Models have shown remarkable capabilities in various tasks, from text generation to code writing. However, the inference process for these models presents significant challenges. LLMs are computationally intensive, often requiring specialized hardware like TPUs or GPUs to achieve reasonable response times. In some cases their substantial size can strain the resources of a single machine. Specifically, models such as Gemini, Claude, and GPT4 are too large to fit on any single GPU or TPU device, let alone on any single multi-accelerator machine, necessitating what we refer to as multi-node server deployment where a single model server “backend” runs as a distributed process on multiple nodes to harness enough accelerator memory to fit and run the model. This talk presents LeaderWorkerSet, a new k8s API that enables multi-node model inference. We demonstrate its capabilities by orchestrating state of the art model servers such as vLLM and JetStream on both GPUs and TPUs.
Speakers
avatar for Abdullah Gharaibeh

Abdullah Gharaibeh

Staff Software Engineer, Google
Abdullah is a staff software engineer at Google and sig-scheduling and working group batch co-chair. He works on Kubernetes and Google Kubernetes Engine, focusing on scheduling and batch workloads.
avatar for Rupeng Liu

Rupeng Liu

Software engineer, Google
Rupeng Liu, a software engineer from the Google's Kubernetes inference team
Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

4:55pm MST

Service Profiling Based Management and Scheduling in K8s - Jia Deng, Cong Xu & Mingmeng Luo, Bytedance
Friday November 15, 2024 4:55pm - 5:30pm MST
We present an open-source solution for the efficient management of resources and scheduling strategies in K8s. Our solution constructs workload-specific resource profiles based on their historical utilization patterns. This approach ensures that workloads receive adequate resources while optimizing overall resource utilization. To accomplish this objective, we employ a custom resource Service Profiling Description (SPD), facilitating a direct correlation between workloads and their resource usages, such as deployments and stateful sets etc. Resource utilization metrics, including CPU, disk I/O, and network I/O, are meticulously collected and aggregated. These usage indicators play a pivotal role in informing the scheduler's decisions regarding workloads allocation. This solution has been deployed within large-scale K8s clusters, addressing diverse workload demands, ranging from those requiring dedicated NUMA nodes to those capable of resource sharing among themselves.
Speakers
avatar for Mingmeng Luo

Mingmeng Luo

Software Engineer, Bytedance
Mingmeng Luo is a software engineer in the Infrastructure Department at ByteDance, where he specializes in the design and development of precision resource management technologies for large-scale Kubernetes clusters. His work focuses on optimizing resource allocation and efficiency... Read More →
avatar for Cong Xu

Cong Xu

Senior Software Engineer, Bytedance
Cong Xu is a Tech Lead and Senior Software Engineer at ByteDance, where he focuses on building and optimizing the container-based cloud platform that hosts internal products such as Douyin and TikTok. From 2016 to 2022, he served as a Staff Research Member at IBM Research, contributing... Read More →
avatar for Jia Deng

Jia Deng

Software Engineer, Bytedance
The speaker currently works for bytedance K8s orchestration team. Before that, the speaker worked for amazon EKSA and VMware Tanzu Mission Control.
Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 1 | 155 BC
  Operations + Performance

4:55pm MST

Zero Downtime Upgrades at Scale: How Okta Manages Hundreds of Clusters Daily - Jérémy Albuixech & Kahou Lei, Okta
Friday November 15, 2024 4:55pm - 5:30pm MST
How do you upgrade your K8s clusters? Perhaps a rolling update of nodes, with services moving around? Can you guarantee a zero-downtime upgrade? Will this method scale and support the velocity of production environments? Likely not. But fear not - you are not alone! At Okta, we maintain hundreds of clusters, each hosting >130 services, with node counts ranging from 20-400 and we are updating them daily. How do we do it? Without an out-of-the-box solutions we had to build our own and we want to share what we learned with all of you! In this talk Kahou and Jeremy will go over the challenges and successes, highlighting how their deployment method provides the foundational blocks to build extra features while reducing the blast radius when something goes wrong - thanks to quick rollbacks and a canary rollouts. In this session attendees will learn how we leverage open source technologies to tackle three main problems: how to scale, how to secure and how to upgrade clusters with no downtime.
Speakers
avatar for Jérémy Albuixech

Jérémy Albuixech

Staff Software Engineer, Okta
Jeremy is a Staff Software Engineer at Okta. Starting as a full stack programmer with a good foundation in Javascript, he then gravitated towards a DevOps role and later became a member of the SRE team at Cisco, picking up an IaC, observability and Kubernetes skillset. With the Okta... Read More →
avatar for Kahou Lei

Kahou Lei

Principal Software Engineer, Okta
Kahou Lei is a Principal Software Engineer with a strong background in Cloud infrastructure and Kubernetes. With 20 years of industry experience, he has held significant positions at renowned companies such as Okta and Cisco. Kahou leads critical software engineering initiatives... Read More →
Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 2 | 251
  Platform Engineering

4:55pm MST

SPIFFE the Easy Way: Universal X509 and JWT Identities Using Cert-Manager - Tim Ramlot & Ashley Davis, Venafi
Friday November 15, 2024 4:55pm - 5:30pm MST
SPIFFE is incredible. Each workload is assigned its own universal identity, simplifying the security and management of communications in distributed systems. While SPIRE (the reference SPIFFE implementation) is exceptionally powerful, it is also quite complex. Deploying SPIRE on Kubernetes requires StatefulSets, which can be challenging and frustrating. Many cloud vendors are starting to offer turnkey SPIFFE solutions, but that comes with risk of vendor lock-in. In this talk, we will demonstrate how to use the Cloud Native cert-manager solution to implement SPIFFE (x509 and JWT) with low operational overhead for all Kubernetes workloads. The session includes all you need to know to issue X.509 SVIDs, use them and validate them. Additionally, we will introduce an experimental solution to convert x509 SVIDs into JWT SVIDs. The demo will highlight how to authenticate to third-party APIs (such as AWS, GCP, Azure, and others) using these JWT SVIDs.
Speakers
avatar for Ashley Davis

Ashley Davis

Staff Software Engineer, Venafi
As a teenager, Ash taught himself to program after wondering how exactly video games were made. That led to adventures trawling through open source codebases, sparking an interest in computers spanning from bare-metal machine code right up to scalable distributed platforms like Kubernetes... Read More →
avatar for Tim Ramlot

Tim Ramlot

Senior Software Engineer - cert-manager maintainer, Venafi
Tim started working at Venafi as a software engineer after his graduation as computer science engineer at Ghent University. He learned about cert-manager and Venafi through a Google Summer of Code internship. His mission at Venafi is to advance his problem solving skills, whilst contributing... Read More →
Friday November 15, 2024 4:55pm - 5:30pm MST
Salt Palace | Level 1 | 151
  Security
 

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
  • 🚨 Contribfest
  • 🪧 Poster Sessions
  • AI + ML
  • Breaks
  • ⚡ Lightning Talks
  • Cloud Native Experience
  • Cloud Native Novice
  • CNCF-hosted Co-located Events
  • Connectivity
  • Data Processing + Storage
  • Emerging + Advanced
  • Experiences
  • Keynote Sessions
  • Maintainer Track
  • Observability
  • Operations + Performance
  • Platform Engineering
  • Project Opportunties
  • Registration
  • SDLC
  • Security
  • Solutions Showcase
  • Sponsor-hosted Co-located Event
  • Tutorials