Loading…
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
or to bookmark your favorites and sync them to your phone or calendar.
strong>Intermediate [clear filter]
arrow_back View All Dates
Wednesday, November 13
 

11:15am MST

Architecting Tomorrow: The Heterogeneous Compute Resources for New Types of Workloads - Alexander Kanevskiy, Intel Finland
Wednesday November 13, 2024 11:15am - 11:50am MST
Imagine managing a set of diverse workloads on a Kubernetes node, operating across dozens of CPU cores and several memory zones. But do you truly comprehend the difference between one CPU core versus another? Are you aware of the impact that different memory zone might have on your workload's efficiency? Will optimisations for one type of workloads be helpful for another? Do you think that your ML workload will behave same way as e.g. Redis? This presentation delves deep into CPU internals, memory types (DRAM, HBM, CXL), and diverse cache/core types and layouts. Explore recent hardware advancements and their impact on workloads. We'll examine native compute resource allocation strategies from a hardware point of view, crucial for enhancing workload performance and optimising energy usage and cost efficiency. Join and learn details of the modern hardware architecture that gives you a framework to make more informed choices on hardware resource optimisation for your infrastructure.
Speakers
avatar for Alexander Kanevskiy

Alexander Kanevskiy

Principal Engineer, Cloud Orchestration Software, Intel Finland
Alexander is currently employed by Intel as Principal Engineer, Cloud Software, focusing on various aspects in Kubernetes: Resource Management, Device plugins for hardware accelerators, Cluster Lifecycle and Cluster APIs. Alexander has over 25+ years of experience in areas of Linux... Read More →
Wednesday November 13, 2024 11:15am - 11:50am MST
Salt Palace | Level 2 | 254 B
  Emerging + Advanced

12:10pm MST

Operationalizing High-Performance GPU Clusters in Kubernetes: Lessons Learned from Training Databricks DBRX - Will Gleich & Wai Wu, Databricks
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Training large language models (LLMs) on GPUs within Kubernetes environments involves significant configuration and complexity, often leading to unique failure scenarios. This presentation will cover the lessons learned from training DBRX, a state-of-the-art LLM, that we developed on a 400-node cluster with a primary workload utilizing 3072 GPUs and the tooling needed to measure and maintain a healthy fleet of nodes and underlying interconnect fabric.
This will include:
  1. How we implemented GPU health detection leveraging Prometheus and DCGM Exporter 
  2. How we monitor GPU Direct Remote Direct Memory Access (GDRDMA) and the challenges of monitoring components that bypass CPU 
  3. Discussion of failure scenarios during training, and how they were addressed Databricks Mosaic AI Training leverages GPU clusters across many cloud providers to maximize availability; we will also discuss the variations we see and how we had to engineer around them.
Speakers
avatar for Wai Wu

Wai Wu

Software Engineer, Databricks
Hey everyone, I was born to deploy Kubernetes ꈍ ω ꈍ
avatar for Will Gleich

Will Gleich

Sr. DevOps Engineer, Databricks
Will Gleich is a Sr. DevOps engineer at Databricks specializing in MLOps and Site Reliability Engineering.
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 2 | 255 E
  AI + ML

12:10pm MST

Can Your Kubernetes Network Handle the Heat? Building Resilience with AI Chaos - Lior Lieberman, Google & Surya Seetharaman, Red Hat
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Kubernetes networking is complex with many APIs, numerous configurations and potential failure points. In the rapidly evolving world of cloud-native applications, ensuring your Kubernetes network can withstand unexpected failures is not just an advantage—it is a necessity. In this talk Surya and Lior, holding distinct leadership roles in Gateway API and NetworkPolicy API, will demonstrate how you can leverage AI-powered Chaos Engineering to stress test Gateways, NetworkPolicies, and Services on a live cluster! They will share their experiences and lessons learned from using Litmus and enhancing K8sGPT to design and execute AI Chaos experiments, as well as focusing on how you can proactively find gaps and bottlenecks in the network infrastructure. This is a great opportunity to learn from real-world disruption scenarios and participate in a collaborative discussion on how we can leverage AI to build robust Kubernetes Networks.
Speakers
avatar for Surya Seetharaman

Surya Seetharaman

Principal Software Engineer, Red Hat Inc.
Surya is an Open Source advocate and contributor, active in the Kubernetes SIG-Network working group. She is working as a Principal Software Engineer at Red Hat in the OpenShift Networking team. Her areas of interest include Cloud Infrastructure and Networked Services and Systems... Read More →
avatar for Lior Lieberman

Lior Lieberman

Site Reliability Engineer, Google
Lior is site reliability engineer at Google working on Google Compute Engine. He is a leading maintainer of ingress2gateway, and an active contributor to Kubernetes SIG network focused on Gateway API.
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | 155 E
  Connectivity

12:10pm MST

Automated Multi-Cloud, Multi-Flavor Kubernetes Cluster Upgrades Using Operators - Ziyuan Chen, Databricks
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Databricks manages over a thousand k8s clusters across three major cloud providers which run critical workloads in cloud regions around the world. This talk describes the system we built to upgrade nodes’ operating system, k8s version, and other configs monthly, supporting EKS, AKS, GKE, and self-managed k8s. Our system is built on k8s operators and performs zero-downtime blue-green rolling updates, respects contracts with services with features like PDBs, maintenance windows, deferred node draining, and custom workload handling plugins. It enables easy rollbacks, has good observability, and incurs minimal human operational cost. This has allowed us to patch vulnerabilities and release infrastructure changes quickly and reliably across the fleet. We will also share our lessons learned on building several operators that work together using the controller-runtime framework, designing the declarative interfaces between them, and achieving consistent behavior across three clouds.
Speakers
avatar for Ziyuan Chen

Ziyuan Chen

Staff Software Engineer, Databricks
Ziyuan Chen is a software engineer at Databricks. He has worked on Databricks' compute and OS infrastructure areas.
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | 155 B
  Operations + Performance

12:10pm MST

Automated Multi-Cloud Large Scale K8s Cluster Lifecycle Management - Sourav Khandelwal, Databricks
Wednesday November 13, 2024 12:10pm - 12:45pm MST
I will present the system developed for cluster rotations across Databricks’ fleet of over a thousand cloud-managed k8s clusters on AWS, Azure, and GCP. Blue-green cluster rotations, or cluster swaps (upgrading by creating a new k8s cluster with a new version/configuration & shifting workloads from the old cluster), allow us to implement major infrastructure changes and upgrade k8s versions with low risk through staged rollouts, seamless rollbacks, zero downtime, and minimal operator intervention. Our system includes a k8s-style continuous reconciliation mechanism to manage cluster swap lifecycles, a fast and reliable cluster state change discovery system, and a k8s workload migration system. We will share methodologies and experiences in constructing this loosely coupled system that orchestrates product workloads and cloud provider APIs for automated cluster swaps. This session will explore the challenges faced, and the benefits of automating large-scale, multi-cloud k8s upgrades.
Speakers
avatar for Sourav Khandelwal

Sourav Khandelwal

Sr. Software Engineer, Databricks
I am a seasoned software engineer with over 10 years of experience in designing and managing large-scale platforms in cloud-native environments. At Databricks, my significant contributions have been pivotal in launching our next-generation cloud infrastructure that helped to transition... Read More →
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 1 | Grand Ballroom H
  Platform Engineering

12:10pm MST

The Hard Truth About GitOps and Database Rollbacks - Rotem Tamir, Ariga
Wednesday November 13, 2024 12:10pm - 12:45pm MST
For two decades now, the common practice for handling rollbacks of database schema migrations has been pre-planned "down migration scripts". A closer examination of this widely accepted truth reveals critical gaps that result in teams relying on risky, manual operations to roll back schema migrations in times of crisis. In this talk, we show why our existing tools and practices cannot deliver on the GitOps promise of "declarative" and "continuously reconciled" workflows and how we can use the Operator Pattern to build a new solution for robust and safe schema rollbacks.
Speakers
avatar for Rotem Tamir

Rotem Tamir

CTO, Ariga
Rotem Tamir (39), father of two. Co-founder and CTO of Ariga, co-maintainer of Atlas and Ent. Ex-data platform architect at Nexar, infrastructure team lead at ironSource.
Wednesday November 13, 2024 12:10pm - 12:45pm MST
Salt Palace | Level 2 | 250 AD
  SDLC

2:30pm MST

Optimizing LLM Performance in Kubernetes with OpenTelemetry - Ashok Chandrasekar, Google & Liudmila Molkova, Microsoft
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Large Language Models are increasing in popularity and their deployments on Kubernetes have steadily increased. LLM applications bring new usage patterns that the industry does not have the expertise in. At the same time, there is a lack of observability in these deployments which makes it difficult to debug performance issues. We will present an end to end walkthrough of how you can leverage client and server LLM observability using Open Telemetry based on the recent efforts in the Kubernetes and Open Telemetry communities to standardize these across LLM clients and model servers. We will also demonstrate how to troubleshoot a real-world performance issue in your LLM deployment and how to optimize your LLM server setup for better performance on Kubernetes. We'll show how to use Kubernetes autoscaling based on custom model server metrics and demonstrate how they offer a superior alternative to using GPU utilization metrics for such deployments.
Speakers
avatar for Liudmila Molkova

Liudmila Molkova

Principal Software Engineer, Microsoft
Liudmila Molkova is a Principal Software Engineer at Microsoft working on observability and Azure client libraries. She is a co-author of distributed tracing implementations across the .NET ecosystem including HTTP client instrumentation and Azure Functions. Liudmila is an active... Read More →
avatar for Ashok Chandrasekar

Ashok Chandrasekar

Senior Software Engineer, Google
Ashok Chandrasekar is a Senior Software Engineer at Google working on AI/ML experience for Google Kubernetes Engine. Previously he was a Staff Engineer at VMware where he led the cluster lifecycle management area for Tanzu Mission Control. He has 7 years of Kubernetes experience working... Read More →
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 2 | 255 E
  AI + ML

2:30pm MST

AIStore as a Fast Tier Storage Solution: Enhancing Petascale Deep Learning Across Cloud Backends - Abhishek Gaikwad & Aaron Wilson, NVIDIA
Wednesday November 13, 2024 2:30pm - 3:05pm MST
As deep learning continues to evolve, the demand for handling petascale datasets efficiently becomes paramount. Current cloud storage solutions often struggle with the speed (throughput) and cost-effectiveness required for these massive datasets, particularly due to the random access needs of machine learning workloads. This talk introduces AIStore (AIS) as a fast-tier storage solution designed to overcome these challenges by offering a fast, scalable, cost-effective tier for deep learning data. AIS features linear scalability with each added storage node - in fact, with each added drive. In this presentation, we will explore the architecture and benefits of AIStore, focusing on its linear scalability and high performance. This session will feature detailed benchmarks and use cases comparing the performance of accessing cloud datasets with and without AIStore, highlighting AIS's ability to deliver high per-GPU throughput and stable latencies.
Speakers
avatar for Aaron Wilson

Aaron Wilson

Senior Software Engineer, NVIDIA
I'm a developer on the AIStore project, focused mainly on our Kubernetes deployments and Python SDK for training workflows. Most recently, I've been working on the AIS K8s Operator and Helm charts, improving AIStore deployment and lifecycle management. My past experience has been... Read More →
avatar for Abhishek Gaikwad

Abhishek Gaikwad

Software Engineer, NVIDIA
Abhishek Gaikwad is a Software Engineer at NVIDIA with a Master of Science degree in Computer Science from San Jose State University. As a key developer and maintainer of AIStore, Abhishek has played a crucial role in its design, development, and management. His contributions include... Read More →
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom A
  Data Processing + Storage

2:30pm MST

Unifying Observability: Correlating Metrics, Traces, and Logs with Exemplars and OpenTelemetry - Kruthika Prasanna Simha & Charlie Le, Apple
Wednesday November 13, 2024 2:30pm - 3:05pm MST
In modern distributed systems, observability is key to understanding application performance and behavior. While metrics, traces, and logs each provide valuable insights, their true power is realized when they are correlated. This talk will dive into the practical benefits and implementation of correlating these signals with exemplars using the OpenTelemetry SDK and Collector, and showcase the results in Grafana. Attendees will learn how to leverage OpenTelemetry to create exemplars which will allow them to navigate from either logs or metrics to their traces.
Speakers
avatar for Kruthika Prasanna Simha

Kruthika Prasanna Simha

Senior Software Engineer, Apple
Kruthika is a software engineer at Apple specializing in building ML enabled observability solutions. She holds a Masters in Computer Engineering and has specialized in Machine Learning. In her free time, she likes to dabble with Jupyter Notebooks for running experiments with data... Read More →
avatar for Charlie Le

Charlie Le

Senior Software Engineer, Apple
Charlie is a software engineer at Apple, specializing in building and scaling cloud native observability solutions and infrastructure. Deeply inspired by the collaborative spirit of open source, he actively contributes to projects like Cortex and OpenTelemetry, shaping the future... Read More →
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom B
  Observability

2:30pm MST

Does My K8s Application Need CPR? Performance Evaluation of a Multi-Cluster Workload Management App - Braulio Dumba & Ezra Silvera, IBM
Wednesday November 13, 2024 2:30pm - 3:05pm MST
KubeStellar (KS) is an open-source Kubernetes multi-cluster workload configuration management system that can be used to manage AI workloads in multi-cluster environments. Hence, understanding KS performance is crucial especially when managing resource intensive AI workloads. In this talk, we will present our experience in analyzing the performance metrics of KS across several dimensions of scalability (e.g., number of bindingPolicies, workload description spaces and number of managed remote clusters) and challenges that arise when conducting performance experiments in a multi-cluster environment. Our insights will demonstrate the utility of benchmarking the performance of a multi-cluster Kubernetes workload management application. Additionally, in this talk, we will demonstrate the usefulness of using several opensource tools such as clusterloader2, kube-burner & kwok to evaluate the performance of multi-cluster Kubernetes management applications.
Speakers
avatar for Ezra Silvera

Ezra Silvera

Senior Technical Staff Member, IBM
Ezra Silvera is a Senior Technical Staff Member at IBM Research. His interests include distributed systems, cloud management, and cloud infrastructure. Ezra is passionate about open-source technologies and has been involved in several notable open source projects such as Docker, KubeVirt... Read More →
avatar for Braulio Dumba

Braulio Dumba

Staff Research Scientist, IBM
Dr. Braulio Dumba is a Staff Research Scientist at IBM Research. In 2018, he joined IBM under the Hybrid Cloud organization. His current research is focus on edge computing and hybrid cloud computing. Dr. Dumba earned a Ph.D. in Computer Science from University of Minnesota, Twin... Read More →
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | 155 B
  Operations + Performance

2:30pm MST

Better Pod Availability: A Survey of the Many Ways to Manage Workload Disruptions - Zach Loafman, Google
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Kubernetes Pods are ephemeral, but some are more ephemeral than others. Kubernetes provides a dizzying array of options to manage and handle Pod disruption. From PodDisruptionBudgets, to "safe-to-evict" annotations, GracefulTermination timeouts and more, it can be incredibly hard to determine the optimal solution for handling Pod disruption and how to manage gracefully terminating your application. Thankfully, due to the extensible nature of Kubernetes we can build CRDs and controllers that can simplify these complex topics for end users. In this talk, we'll present an in-depth analysis of the built-in options and how they work (or don't). While this problem is not unique to game-serving, we'll deep-dive and explain how Agones (an open-source session orchestration system layered on Kubernetes) solves this problem with a simple abstraction to hide the complexity!
Speakers
avatar for Zach Loafman

Zach Loafman

Staff Software Engineer, Google
Zach leads Google’s GKE Games team. He was previously lead of the Kubernetes Control Plane team for GKE, lead of the GKE Cluster Lifecycle team, worked on Kubernetes prior to GA, and was one of the founding members of the Google Kubernetes Engine team.
Wednesday November 13, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom H
  Platform Engineering

2:30pm MST

Tutorial: Confidential Containers 101: A Hands-on Workshop - Archana Choudhary & Suraj Deshmukh, Microsoft
Wednesday November 13, 2024 2:30pm - 4:00pm MST
Here is how you can be prepared for the workshop, to follow along: https://github.com/surajssd/kubecon-na24-workshop/blob/main/preparation.md

As traditional enterprises with stringent data protection requirements become cloud-native and migrate to Kubernetes on public clouds, they are wondering: “Is my data secure on this shared hardware? Can someone with a host access snoop on my data?” And especially, with the upcoming Digital Operational Resilience Act (DORA) in Europe mandating data protection in use, it’s crucial for users to familiarize themselves with solutions like Confidential Containers (CoCo), a CNCF sandbox project. In this, first of its kind, hands-on workshop we’ll dive deep into using CoCo with k8s. We’ll explore real-world challenges, such as ensuring data confidentiality from platform owners (cloud providers), and show you how to overcome them. Through practical exercises, you’ll learn to set up CoCo and secure your containerized workloads, turning theory into practice. Attendees will discover streamlined practices, find robust protection mechanisms, and gain strategic insights into adopting CoCo.
Speakers
avatar for Suraj Deshmukh

Suraj Deshmukh

Senior Software Engineer, Microsoft
Suraj has worked with Kubernetes since version 1.3. He organized the Kubernetes Bangalore meetup and helped bring Kubernetes to the masses. To make Kubernetes easier has worked earlier on projects like Kompose, which converted docker-compose to Kubernetes artifacts. He has spoken... Read More →
avatar for Archana Choudhary

Archana Choudhary

Ms, Microsoft
A software engineer who has been exploring cloud-native technologies, particularly focusing on confidential containers over the past several months.
Wednesday November 13, 2024 2:30pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom G
  Tutorials, Security

3:25pm MST

A Tale of 2 Drivers: GPU Configuration on the Fly Using DRA - Alay Patel & Varun Ramachandra Sekar US, Nvidia
Wednesday November 13, 2024 3:25pm - 4:00pm MST
NVIDIA’s GeForceNow is a cloud gaming service that allows users to stream video games from NVIDIA's servers to a wide range of devices, including PCs, Macs, Android devices, iOS devices, and smart TVs. Under the hood, it is powered by Kubernetes running Kubevirt VMs. For a seamless user experience, GeForceNow dynamically switches GPU drivers to accommodate either passing through an entire GPU or slicing it into multiple virtual GPUs, all while keeping utilization close to 100% across the datacenter. This poses significant challenges when using the traditional device plugin API provided by Kubernetes. In this talk, we explore GeForce Now’s journey to transition away from the traditional device plugin API in favor of Dynamic Resource Allocation (DRA). We'll share valuable insights for anyone looking to perform a similar migration of their own. Join us to learn about the challenges, solutions, and best practices to help optimize your GPU-accelerated workloads in the cloud.
Speakers
avatar for Alay Patel

Alay Patel

Senior Software Engineer, Nvidia
Alay is a Senior Software Engineer at Nvidia where he works on scaling up the workloads running on GPU Compute infrastructure. He is passionate about open source with a focus on Kubernetes and platform engineering.
avatar for Varun Ramachandra Sekar US

Varun Ramachandra Sekar US

Senior Software Engineer, Nvidia
Developer by day, Dog whisperer by night.
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 B
  AI + ML

3:25pm MST

Using OpenTelemetry for Deep Observability Within Messaging Queues - Shivanshu Raj Shrivastava & Ekansh Gupta, SigNoz
Wednesday November 13, 2024 3:25pm - 4:00pm MST
The recent changes in OpenTelemetry have made new semantic conventions and changes in agents to better monitor messaging queues such as Kafka, RabbitMQ, and Amazon SQS, etc. In this session, we'll discuss how those semantic conventions are standardizing the telemetry collected from producers, consumers, and the messaging queues, and how in-depth observability can be achieved by correlating producer-to-consumer spans with the metrics collected from Kafka. Additionally, We will demonstrate how the Kafka Java client side instrumentation enabled and JMX metrics collected from Kafka how OpenTelemetry instrumentation can help for metrics to trace and trace to metrics correlation and spot reasons for anomalies like increased consumer lag, partition failures, time taken by messaging queues. This will also help in giving the corresponding traces in time that can help end users to better delve into their infrastructures and optimize their asynchronous applications.
Speakers
avatar for Ekansh Gupta

Ekansh Gupta

SDE, SigNoz
Ekansh is a Software Development Engineer with SigNoz, with active involvement in various open-source and cloud native communities for upwards two years now. He was previously an SDE Intern at SteamLabs. He is also a speaker for a couple of talks at PyCon, KubeCon and MozFests. Ekansh... Read More →
avatar for Shivanshu Raj Shrivastava

Shivanshu Raj Shrivastava

Founding Engineer, SigNoz
Shivanshu is a Founding Engineer at SigNoz, working on building an OTeL native observability product. He has a keen interest in deep tech and OSS. He is a CNCF ambassador and a member of CNCF projects like OTeL, k8s, and Istio. He has got the opportunity to mentor contributors in... Read More →
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Grand Ballroom B
  Observability

3:25pm MST

Setting New Standards for Reliability in Cloud Native Multi-Region Applications - Trey Caliva, Global Payments
Wednesday November 13, 2024 3:25pm - 4:00pm MST
As a multinational FinTech provider, processing over 32 billion card transactions for 816 million accounts, Global Payments requires globally available architectures with quick disaster recovery while maintaining subsecond latencies. In addition, these workloads require strict adherence to compliance standards. This session will explore the high-level architectural decisions implemented in a cloud-native redesign and cloud migration of a mission critical legacy .NET application. Key cloud native tools leveraged include Kubernetes on GCP, and the use of CockroachDB as a cloud native database solution. We will explore how leveraging these cloud native technologies achieved extreme fault tolerance in a multi-region deployment, setting new standards for performance and reliability.
Speakers
JH

Jim Hatcher

Solutions Engineer, Cockroach Labs
avatar for Trey Caliva

Trey Caliva

Principal Cloud Architect, Global Payments
Trey Caliva is an Architect and engineer with 10+ years of hands-on experience planning, developing, managing, and securing deployments in Google Cloud and AWS. He is currently Principal Cloud Architect at Global Payments, a Fortune 500 company and a member of the S&P 500 focused... Read More →
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | 155 B
  Operations + Performance

3:25pm MST

Scale Job Triggering with a Distributed Scheduler - Cassie Coyle & Artur Souza, Diagrid
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Imagine scheduling thousands or millions of jobs that are persisted and triggered timely and resilient to downtime. Some jobs might be triggered every second while others need to reliably be triggered on the first day of the month. Achieving high throughput and reliability is critical for the performance and operational efficiency of modern distributed systems. How can traditional cron job scheduling be extended? How can distributed systems handle job scheduling with minimal downtime? What challenges arise when scaling job scheduling to thousands or millions of jobs? In this session, Artur and Cassie will delve into the design of Dapr’s distributed Scheduler and how users can start using it today. You will gain a comprehensive understanding of how Dapr’s Scheduler unblocks scalability of actors and workflows while also enabling new capabilities, like delayed pubsub and schedule job API.
Speakers
avatar for Artur Souza

Artur Souza

Head of Engineering, Diagrid
I am a maintainer of Dapr since 2019, helped the project reach the 1.0 stable version and keeping frequent releases since then. Currently Head of Engineering at Diagrid, leading the engineering teams building Conductor and the next generation of managed cloud native APIs via Dapr... Read More →
avatar for Cassie Coyle

Cassie Coyle

Software Engineer, Diagrid
Cassie, a devoted software engineer at Diagrid actively contributes to Dapr, focusing on Go backend development to simplify the creation of resilient, event-driven, and microservices-based apps. She is a member of the Dapr Day and AppDeveloperCon 2024 program committees. Her work... Read More →
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 250 AD
  SDLC

3:25pm MST

CEL-Ebrating Simplicity: Mastering Kubernetes Policy Enforcement - Kevin Conner, Getup Cloud & Anish Ramasekar, Microsoft
Wednesday November 13, 2024 3:25pm - 4:00pm MST
As Kubernetes deployments grow increasingly complex, robust policy enforcement is crucial. The Common Expression Language (CEL) provides a powerful solution, enabling the creation of sophisticated, human-readable expressions for Kubernetes policies. This session explores CEL's integration with Kubernetes, simplifying policy definition and enforcement. Key takeaways: - Fundamentals of CEL and its Kubernetes integration. - Practical use cases for CEL in admission control, resource management, and security. - Enhancing policy expressiveness and flexibility with CEL. - Introduction to CEL Playground for testing and validating CEL expressions. Through live demos, learn to leverage CEL and CEL Playground for streamlined policy management in Kubernetes. Ideal for administrators, developers, and DevOps professionals, this session equips you to enhance your Kubernetes policies using CEL. Join us to discover how CEL and CEL Playground can transform your Kubernetes policy management.
Speakers
avatar for Anish Ramasekar

Anish Ramasekar

Principal Software Engineer, Microsoft
Anish Ramasekar is a software engineer at Microsoft. He is on the Azure Container Upstream team building features for Kubernetes upstream and various CNCF projects that are part of the Azure Kubernetes Service. Anish is a maintainer of the Secrets Store CSI Driver project.
avatar for Kevin Conner

Kevin Conner

Chief Engineer, Getup Cloud
Kevin Conner is the Chief Engineer at GetUp Cloud, a startup focused on Kubernetes and DevSecOps. He has worked at startups like Integrated Micro Products, Arjuna Technologies, JBoss, and Aviatrix, as well as Sun Microsystems and Red Hat where he led teams for Cloud Enablement, Service... Read More →
Wednesday November 13, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | 151 G
  Security

4:30pm MST

Making Kubernetes Simpler for Accelerated Workloads - Susan Wu, Google; Lucy Sweet, Uber; Mitch McKenzie, Weave; Aditya Shanker, Crusoe; Rebecca Weekly, Geico
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Kubernetes and the open-source ecosystem for AI frameworks have been great for LLM innovation, empowering developers to build applications that use natural language as the interface to data. Yet, many developers and cluster operators struggle to put these frameworks into production use. In this session, hear from several platform engineers responsible for designing core infrastructure supporting accelerated workloads, services, large language model training and inference pipelines. You can expect to come away with guidance, hear of pitfalls to watch out for and learn how they successfully abstracted the infrastructure complexity to improve their research users' experience and velocity. Panelists include: Lucy Sweet, Senior Software Engineer (Infrastructure), Uber, Mitch McKenzie, Site Reliability Engineer - Machine Learning Operations, Weave, Susan Wu, Outbound Product Manager, Google
Speakers
avatar for Susan Wu

Susan Wu

Outbound Product Manager, Google
Susan is an Outbound Product Manager for Google Cloud, focusing on GKE Networking and Network Security. She previously led product and technical marketing roles at VMware, Sun/Oracle, Canonical, Docker, Citrix and Midokura (part of Sony Group). She is a frequent speaker at conferences... Read More →
avatar for Lucy Sweet

Lucy Sweet

Senior Software Engineer at Uber, Uber
Lucy is a Senior Software Engineer at Uber Denmark who works on software infrastructure
avatar for Aditya Shanker

Aditya Shanker

Senior PM, Crusoe
Aditya is a Senior PM at Crusoe Cloud, responsible for building out managed orchestration and platform services
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 B
  AI + ML

4:30pm MST

Platform Performance Optimization for AI - a Resource Management Perspective - Antti Kervinen, Intel & Dixita Narang, Google
Wednesday November 13, 2024 4:30pm - 5:05pm MST
How much node resource management can affect AI workload performance? What options are there? What is the trade-off between total throughput and low latencies? In this talk we take a systematic approach to Platform Performance Optimization. We walk through the whole path from goal setting, gathering data, analysis, visualizations and conclusions. At each stop along the path we share our practical experiences in a case of LLM inference optimization. You will find many considerations, findings and practical tricks to take away. For instance, how to instrument PyTorch without touching the source or a container image, how to enable changing what we are measuring without new expensive benchmark reruns, and how much more we can learn from visualizations compared to numeric averages and percentiles. Finally we share real results from our case: how resource management increased total token throughput per worker node by more than 3.5x from the baseline.
Speakers
avatar for Antti Kervinen

Antti Kervinen

Cloud Orchestration Software Engineer, Intel
Antti Kervinen is a Cloud Orchestration Software Engineer working at Intel, whose interest in Linux and distributed systems has led him from academic research of concurrency to the world of Kubernetes. When unplugged, Antti spends his time outdoors discovering wonders of nature.
avatar for Dixita Narang

Dixita Narang

Software Engineer, Google
Dixita Narang is a Software Engineer at Google on the Kubernetes Node team. With a primary focus on resource management within Kubernetes, Dixita is deeply involved in the development and advancement of the Memory QoS feature, which is currently in the alpha stage. She is a new contributor... Read More →
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 E
  AI + ML

4:30pm MST

From Observability to Performance - Nadia Pinaeva, Red Hat & Antonio Ojea, Google
Wednesday November 13, 2024 4:30pm - 5:05pm MST
No matter how fast the Services on your Kubernetes cluster are, users would love them to be faster. But how do you get from a huge pile of metrics across a distributed system to real user experience improvements? There is a way, and with the right tools and the right approach, you can better understand and evaluate Service performance. In this talk, you'll learn how to identify the performance parameters that directly translate to user experience. We will explore how to collect performance metrics from running Kubernetes clusters without disrupting normal operations using tools like Prometheus, Grafana, kube-burner, and custom instrumentation. We will discuss how to translate the collected metrics and analysis into concrete actions and how to identify bottlenecks and implement optimizations to enhance Service performance. This talk is ideal for k8s networking developers, administrators, SREs, DevOps engineers, and anyone responsible for managing or optimizing Kubernetes networking.
Speakers
avatar for Antonio Ojea

Antonio Ojea

Software Engineer, Google
Antonio Ojea is a Software Engineer at Google, where he works on Kubernetes. He is one of the top contributors of the Kubernetes project, with a stronger presence on the areas of networking and reliability. He has a vast experience in Open Source, networking and distributed systems... Read More →
avatar for Nadia Pinaeva

Nadia Pinaeva

Senior Software Engineer, Red Hat
Nadia Pinaeva is a Senior Software Engineer at Red Hat working on Openshift Networking. She collaborates with the SIG-network-policy to improve network security for Kubernetes clusters, and works on ovn-kubernetes network plugin.
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 155 E
  Connectivity

4:30pm MST

Experience in Designing & Implementing a Cloud Native Framework for Farm Data Analytics - Braulio Dumba, IBM & Gloire Rubambiza, Cornell University
Wednesday November 13, 2024 4:30pm - 5:05pm MST
This work is based on 17 months experience managing a digital agriculture platform that has aggregated and processed tens of gigabytes of data on 1500 cows on a commercial dairy farm. Significant challenges surfaced tied to multi-cluster management, fault-tolerance, and privacy as the number of applications and farm management models grew. To bridge this gap, we designed and implemented a cloud native networked system for multi-cluster configuration and management of farm data analytics that leverages KubeStellar and Software-Defined Farm paradigm. Our experience from designing, implementing and deploying this framework showcase how Kubernetes can enable farmers and agribusinesses to leverage the power of containerization and cloud-native computing to optimize workflows and streamline agricultural operations. This work presents progress towards cloud-native, scalable, and fault-tolerant data analytics in digital farming with potential environmental, financial, and societal impacts.
Speakers
avatar for Braulio Dumba

Braulio Dumba

Staff Research Scientist, IBM
Dr. Braulio Dumba is a Staff Research Scientist at IBM Research. In 2018, he joined IBM under the Hybrid Cloud organization. His current research is focus on edge computing and hybrid cloud computing. Dr. Dumba earned a Ph.D. in Computer Science from University of Minnesota, Twin... Read More →
avatar for Gloire Rubambiza

Gloire Rubambiza

Ph.D. Candidate, Cornell University
Dr. Gloire Rubambiza is a postdoctoral associate in CS at Cornell University, where he conducts research in hybrid cloud computing for digital agriculture with an emphasis on societal impact. At Cornell, he was a University Fellow, a fellow of NSF National Research Traineeship in... Read More →
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 254 B
  Emerging + Advanced

4:30pm MST

Perform Laser Focused Deployments by Deciding in Advance the Blast Radius - Kostis Kapelonis, Octopus deploy
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Progressive Delivery is an advanced deployment method that allows for zero-downtime application releases. Argo Rollouts is a Kubernetes controller that allows you to adopt progressive delivery in the form of blue/green and canary deployments. We see a lot of teams that choose an arbitrary number of clients that access the new version of a canary. Yes, it is very easy to send only 10% of the traffic to the new version of a Kubernetes deployment. But sometimes you want to choose WHICH 10% sees the new traffic. In this talk we will see several approaches on pinning down specific clients to the old or new version and advanced scenarios for sending canary traffic only to a specific subset of users such as internal employees or customers who have expressed their interest on seeing brand new releases as soon as possible.
Speakers
avatar for Kostis Kapelonis

Kostis Kapelonis

Developer Advocate, Codefresh by Octopus Deploy
Kostis is a software engineer/technical-writer dual class character. He lives and breathes automation, good testing practices and stress-free deployments with GitOps.
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 250 AD
  SDLC

4:30pm MST

Expanding the Capabilities of Kubernetes Access Control - Jimmy Zelinskie, authzed & Lucas Käldström, Upbound
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Kubernetes RBAC is an effective way of managing ACLs in one cluster. However, there are many other effective paradigms out there, such as Attribute- & Relation-based Access Control. In this talk, we’ll demystify how these differ, and when to use respective paradigms, giving context and guidance. We’ll highlight how Kubernetes access control has recently evolved towards supporting lots of different use-cases. We take this opportunity to cover multiple perspectives: security within a single cluster (zooming in) and security within real-life production environments with external services and multiple clusters (zooming out). As containers became ubiquitous first with excellent tools like Docker, we believe the same can and will happen for access control, yielding uniform, interoperable and understandable authorization. Finally, we'll propose future work that could be done to supercharge Kubernetes and ensure it keeps up with the ever increasing security requirements in our industry.
Speakers
avatar for Lucas Käldström

Lucas Käldström

Senior Software Engineer, Upbound
Lucas is a Kubernetes and cloud native expert who has been serving the CNCF community in lead positions for 6 years. He’s awarded Top CNCF Ambassador 2017 with Sarah Novotny. Lucas was a co-lead for SIG Cluster Lifecycle, co-created kubeadm, Weave Ignite, and ported Kubernetes to... Read More →
avatar for Jimmy Zelinskie

Jimmy Zelinskie

Co-founder, authzed
Jimmy Zelinskie is a software engineer and product leader with a goal of democratizing software via open source development. He's currently CPO of authzed where he's focused on bringing hyperscaler best-practices in authorization to the industry at large. At CoreOS, he helped pioneer... Read More →
Wednesday November 13, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | 151 G
  Security

4:30pm MST

Tutorial: Get the Most Out of Your GPUs on Kubernetes with the GPU Operator - Eduardo Arango Gutierrez, Tariq Ibrahim, Amanda Moran & Christopher Desiniotis, NVIDIA; David Porter, Google
Wednesday November 13, 2024 4:30pm - 6:00pm MST
NVIDIA’s GPU operator has become the de-facto standard for managing GPUs in Kubernetes at scale. This tutorial provides in-depth, hands-on training on the various GPU sharing techniques that are possible with the GPU operator. Participants will learn to deploy jobs utilizing these sharing techniques, as well as get hands-on experience on the installation and configuration of the NVIDIA GPU Operator itself. This includes an in-depth exploration of its two primary CRDs: ClusterPolicy and NVIDIADriver. These CRDs are essential for configuring GPU-accelerated nodes, enabling GPU sharing mechanisms, and performing GPU driver upgrades. The session will culminate with practical use cases, such as training an AI/ML model and giving participants firsthand experience in managing a GPU-accelerated Kubernetes cluster.
Speakers
avatar for Christopher Desiniotis

Christopher Desiniotis

Senior Systems Software Engineer, NVIDIA
Christopher Desiniotis is a Senior Systems Software Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator, a widely used tool for managing GPUs in Kubernetes, and is focused on increasing... Read More →
avatar for David Porter

David Porter

Staff Software Engineer Google, Google
David Porter is a Staff Software Engineer at Google on the Kubernetes node team. David’s focus is on the kubelet node agent and the resource management area. He is primary maintainer of cAdvisor, a resource monitoring library widely used in kubernetes, reviewer of a SIG Node, and... Read More →
avatar for Eduardo Arango Gutierez DE

Eduardo Arango Gutierez DE

Senior systems software engineer, NVIDIA
Eduardo is a Senior Systems Software Engineer at NVIDIA, working on the Cloud Native Technologies team. Eduardo has focused on enabling users to build and deploy containers on distributed environments.
avatar for Tariq Ibrahim

Tariq Ibrahim

Senior Software Engineer, NVIDIA
Tariq Ibrahim is a Senior Cloud Platform Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator. He has also contributed to several cloud native OSS projects like kube-state-metrics, Istio... Read More →
avatar for Amanda Moran

Amanda Moran

https://www.nvidia.com/en-us/, NVIDIA
Amanda has been working in technology since graduating from SCU in 2012 with a Master’s in Science in CS. Prior to this she had graduated with an BS in Biology from UW. Amanda has worked the last 12 years as a Software Engineer, a Solutions Architect, and an Engineering Manager... Read More →
Wednesday November 13, 2024 4:30pm - 6:00pm MST
Salt Palace | Level 1 | Grand Ballroom G
  Tutorials, AI + ML

5:25pm MST

Building Resilience for Large-Scale AI Training: GPU Management, Failure Detection, and Beyond - Ganeshkumar Ashokavardhanan, Microsoft & Ace Eldeib, Cohere
Wednesday November 13, 2024 5:25pm - 6:00pm MST
As AI training scales to thousands of GPUs across hundreds of machines, hardware failure becomes an expensive risk. From GPU faults to network performance degradation, undetected problems can sabotage training jobs, inflating costs, and slowing development. This talk dives into failure and orchestration challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on experience from cloud providers and training large language models, we will share best practices for efficient identification, remediation, and prevention of GPU failures.
Speakers
avatar for Ganeshkumar Ashokavardhanan

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft
Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →
avatar for Ace Eldeib

Ace Eldeib

Staff Software Engineer, Cohere
Ace is a Staff Software Engineer at Cohere working on training and serving infrastructure for large language models. Prior to that, he worked on Azure Kubernetes service and ran self-managed Kubernetes for other Azure services.
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 155 E
  AI + ML

5:25pm MST

Production AI at Scale: Cloudera’s Journey in Building a Robust Inference Platform - Zoram Thanga & Peter Ableda, Cloudera
Wednesday November 13, 2024 5:25pm - 6:00pm MST
In this session, we talk about Cloudera AI Inference Service, a secure, large scale platform for generative AI and predictive inference workloads, built using state of the art Kubernetes, CNCF and Apache open source projects. We take the audience through our journey in building this platform and share the experiences we gained along the way. The platform is built using openness, security, scalability, performance and standards compliance as guiding principles. We demonstrate that it is possible to be open and secure at the same time, and that organizations can incorporate production grade AI inferencing into their Big Data environments. This session will cover the architecture of the platform, and explain how we handle performance, scaling, authentication, fine grained authorization and audit logging, all of which are critical considerations for production inferencing.
Speakers
avatar for Peter Ableda

Peter Ableda

Director, Product Management, Cloudera
Peter Ableda is the Director of Product Management for Cloudera’s AI product suite, bringing over a decade of experience in data management and advanced analytics. Holding a Master of Science degree in Computer Science from the Budapest University of Technology, Peter has dedicated... Read More →
avatar for Zoram Thanga

Zoram Thanga

Principal Engineer, Cloudera
Zoram is a Principal Engineer, Enterprise AI Platform in Cloudera. He has been working in the software industry for over 23 years, and has been involved in building clustering software, containers, file systems, analytical query engines, and ML/AI platforms. He is a committer in the... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 2 | 255 E
  AI + ML

5:25pm MST

Creating Paved Paths for Platform Engineers - Ritesh Patel, Nirmata; Abby Bangser, Syntasso; Viktor Farcic, Upbound; Nicholas Morey, Akuity; Praseeda Sathaye, Amazon
Wednesday November 13, 2024 5:25pm - 6:00pm MST
The platform engineering team's role has evolved into a pivotal one as the custodian of the internal developer platform. However, these teams often find themselves in a quagmire of identifying the right components to include in their platforms, particularly in the ever-expanding CNCF landscape. This panel session discusses these challenges by exploring the concept of 'Paved Paths' as a strategic approach to guide platform teams in their journey of building an internal developer platform (IDP). 'Paved Paths' offers a solution by providing platform engineering teams with proven reference architectures (e.g. CNOE and the BACK Stack). This approach prevents them from starting from scratch and getting lost in the vast CNCF landscape. By offering proven and opinionated reference architectures, platform teams can focus on enhancing developer experiences and optimizing higher-level workflows rather than grappling with the complexities of identifying foundational components for their IDP.
Speakers
avatar for Viktor Farcic

Viktor Farcic

Developer Advocate, Upbound
Viktor Farcic is a lead rapscallion at Upbound, a member of the CNCF Ambassadors, Google Developer Experts, CDF Ambassadors, and GitHub Stars groups, and a published author. He is a host of the YouTube channel DevOps Toolkit and a co-host of DevOps Paradox.
avatar for Ritesh Patel

Ritesh Patel

Co-Founder & VP Product, Nirmata
Ritesh Patel is Co-founder and leads Products at Nirmata, the creators of Kyverno. At Nirmata, he is responsible for commercial products for security and operations (SecOps) automation powered by policy as code. He also leads key technology partnerships. Ritesh has 20+ years of experience... Read More →
avatar for Praseeda Sathaye

Praseeda Sathaye

Principal Specialist Solution Architect, Amazon (AWS)
Praseeda Sathaye is a Principal Specialist SA for App Modernization and Containers at Amazon Web Services based in Bay Area California. She has been focused on helping customers speed their cloud-native adoption journey by modernizing their platform infrastructure, internal architecture... Read More →
avatar for Nicholas Morey

Nicholas Morey

Senior Developer Advocate, Akuity
Nicholas Morey is a Platform Engineer with a passion for DevOps practices. He is on the team at Akuity as a Developer Advocate, working with the community on anything Argo and Kargo-related. He is an experienced Argo CD operator and a Certified Kubernetes Administrator.
avatar for Abby Bangser

Abby Bangser

Principal Engineer, Syntasso
Abby is a Principal Engineer at Syntasso delivering Kratix, an open-source cloud-native framework for building internal platforms on Kubernetes. Her keen interest in supporting internal development comes from over a decade of experience in consulting and product delivery roles across... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | Grand Ballroom H
  Platform Engineering

5:25pm MST

Taming Your Application’s Environments - Marcos Lilljedahl, Dagger & Mauricio "Salaboy" Salatino, Diagrid
Wednesday November 13, 2024 5:25pm - 6:00pm MST
How coupled are your applications code and pipelines to its target cloud or on-prem environment? Kubernetes helps us to abstract how we run our workloads. However, there are other aspects, like infrastructure dependencies, service configuration, build process, deployment descriptors, etc., which need to be considered to make an application portable across multiple environments. Focusing on these aspects make a big difference when migrating apps to reduce costs, meeting compliance requirements or leveraging a specific tech only available somewhere else. Join us to cover three techniques you can implement to level up your SDLC: - Modularizing and enhancing our delivery pipelines to simplify complex environments (Crossplane and Dagger) - Building consistent experiences around well-known interfaces (CloudEvents, Dapr, and OpenFeature) to minimize runtime drift. - Design with separation of concerns to enable fast feedback loops between development and operation teams (Argo CD, Knative)
Speakers
avatar for Marcos Lilljedahl

Marcos Lilljedahl

Software Engineer, Dagger
Dad, Docker Captain, OSS lover, helmsman and wine drinker. Father of a joyful kid and wannabe surfer. I like listening to jazz music and tinker with some fun projects when possible. Avid open source contributor.
avatar for Mauricio Salatino

Mauricio Salatino

OSS Software Engineer, Diagrid
Mauricio works as an Open Source Software Engineer at @Diagrid, contributing to and driving initiatives for the Dapr OSS project. Mauricio also serves as a Steering Committee member for the Knative Project and Co-Leading the Knative Functions initiative. He published a book titled... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 2 | 250 AD
  SDLC

5:25pm MST

From Observability to Enforcement: Lessons Learned Implementing eBPF Runtime Security - Anna Kapuścińska & Kornilios Kourtis, Isovalent
Wednesday November 13, 2024 5:25pm - 6:00pm MST
eBPF is getting widely adopted in cloud native runtime security tools like Falco, KubeArmor, and Tetragon. Using eBPF we can collect relevant security events right in the kernel and pass them to Security Engineers for retroactive attack detection and response. Having reliable and complete visibility is great, but wouldn't it be even better to proactively prevent attacks in progress? This talk covers the Tetragon team’s experience moving from security observability to enforcement and lessons learned along the way: from defining security models to hardening interactions between the local kernel and distributed Kubernetes systems. It will deep dive into how eBPF-based enforcement works, why it differs from observability, and the challenges of implementing it. The audience will walk away understanding the inner workings and common pitfalls of eBPF-based runtime security.
Speakers
avatar for Kornilios Kourtis

Kornilios Kourtis

Software Engineer, Isovalent at Cisco
I am a software engineer at Isovalent, working on cloud-native networking, security, and observability using eBPF. Before that, I worked in industrial (IBM) and academic research (ETH Zurich, NTU Athens) in systems, including operating systems, storage and network stacks, and high-performance... Read More →
avatar for Anna Kapuscinska

Anna Kapuscinska

Software Engineer, Isovalent at Cisco
Anna is a software engineer at Isovalent, focusing on eBPF-based observability and security. Her previous roles span the industry: she wore both developer and SRE hats, and worked in AdTech, FinTech, public healthcare, end-user SaaS company and a hosting provider. On good weather... Read More →
Wednesday November 13, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | 151 G
  Security

6:00pm MST

🪧 Poster Session (PS01): Climatik: Cloud Native Sustainable LLM via Power Capping - Chen Wang, IBM & Vincent Hou, Bloomberg L.P.
Wednesday November 13, 2024 6:00pm - 8:00pm MST
As GenAI workloads grow, the need for advanced accelerators with higher power consumption is surging. NVIDIA GPU peak power has risen from 300W for V100 to 1000W for B100. However, current power infrastructure and cooling systems are not designed to handle rapid power increases, leading to challenges like limited accelerator deployment in some regions or overheating risks that could cause fire hazards. We propose Climatik, a dynamic power capping system that enables data center and cluster admins and developers to set power caps dynamically at the cluster, service namespace, and rack levels. Climatik leverages Kepler for observability and offers APIs for integration with Kubernetes control knobs, including autoscalers, schedulers, and queuing systems, to ensure power caps are maintained across all levels. We will demo how to use Climatik to configure power capping for a large language model (LLM) inference service on KServe and show how power capping influences KEDA on autoscaling.
Speakers
avatar for Chen Wang

Chen Wang

Senior Research Scientist, IBM
Chen Wang is a Staff Research Scientist at the IBM T.J. Watson Research Center. Her interests lie in Kubernetes, Container Cloud Resource Management, Cloud Native AI systems, and applying AI in Cloud system management. She is an open-source advocate, a Kubernetes contributor, and... Read More →
avatar for Vincent Hou

Vincent Hou

Senior Software Engineer, Bloomberg L.P.
Vincent Hou is a Chinese software engineer, who used to study in Belgium and is currently working in US. He has been an active open source contributor, since 2010. He used to be an active contributor to Cinder project, OpenStack block storage service, and a core committer of OpenWhisk... Read More →
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase
  🪧 Poster Sessions, AI + ML

6:00pm MST

🪧 Poster Session (PS02): 0.0.0.0 Day: Exploiting Localhost APIs from the Browser - Mic McCully, Oligo
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Browser-based attacks are not new in the malicious landscape of attack patterns. Browsers remain a popular infiltration method for attackers.  While seemingly local, services running on localhost are accessible to the browser using a flaw we found, exposing the ports on the localhost network interface, and leaving the floodgates ajar to remote network attacks. In this live demo and attack simulation we’ll unveil a zero-day vulnerability (still under responsible disclosure) in Chrome and other browsers, and how we use the 0-day to attack developers behind firewalls. We will demonstrate remote code execution on a wildly popular open-source platform serving millions in the data engineering ecosystem, that seems to run on localhost. In our talk, we will present novel attack techniques, targeting developers and employees within an organization, that are behind firewalls. This will be a first-ever deep dive into this newly discovered zero-day vulnerability.
Speakers
avatar for Mic McCully

Mic McCully

Field CTO, Oligo Security
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase
  🪧 Poster Sessions, Security

6:00pm MST

🪧 Poster Session (PS03): Unleashing the Power of Init and Sidecar Containers in Kubernetes - Carlos Sanchez & Natalia Angulo, Adobe
Wednesday November 13, 2024 6:00pm - 8:00pm MST
This session dives deep into the power of init and sidecar containers, the issues they solve and why they are very useful when managing Kubernetes workloads. We will explore real-world use cases that show how these tools can: * Simplify complex deployments: Break down intricate deployments into manageable steps. * Enhance security: Isolate security critical tasks within your pods and ongoing security measures. * Facilitate rapid and isolated changes: when everyone is interested in updating the same service, separation of concerns is critical for rapid development. * Boost application functionality: Utilize sidecar containers to inject essential functionalities like logging, monitoring, and networking capabilities without modifying your main application code. Our goal is to share our experience and challenges managing thousands of environments in Kubernetes, how we manage init and sidecar containers and what problems they solve for us.
Speakers
avatar for Natalia Angulo

Natalia Angulo

Software Developer Engineer, Adobe
Natalia Angulo is a Software Development Engineer at Adobe Experience Manager, contributing to Site Reliability tasks and the development of new features inside AEM, and specially helping with their infrastructure management. She is passionate about maths, coding puzzles and teaching... Read More →
avatar for Carlos Sanchez

Carlos Sanchez

Principal Scientist, Adobe
Carlos Sanchez is a Principal Scientist at Adobe Experience Manager, specializing in software automation, from build tools to Continuous Delivery and Progressive Delivery. Involved in Open Source for over 20 years, he is the author of the Jenkins Kubernetes plugin and a member of... Read More →
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

6:00pm MST

🪧 Poster Session (PS04): Optimizing Pod Affinity in Kubernetes: A Mathematical Approach to Workload Placement - Jack Xue, Microsoft
Wednesday November 13, 2024 6:00pm - 8:00pm MST
A standout feature of Kubernetes is its sophisticated mechanism for pulling container images from repositories, aligning containers with the appropriate pods, and strategically deploying pods to nodes that meet their resource requirements—such as CPU, GPU, RAM, network, and storage. This process adheres to the defined affinity and anti-affinity specifications between pods and nodes. Despite these capabilities, the challenge of optimally arranging a multitude of workloads, each comprising several pods within a cluster, remains an ongoing endeavor. In our research, we illustrate that a set of YAML files, which detail a workload deployment request, can be systematically transformed into a Binary Integer Linear Programming (BILP) model. Depending on the specific optimization goals, the objective functions of the model can be tailored accordingly. With the imposition of broad conditions, it is feasible to derive an optimal solution that adheres to polynomial time complexity constraints.
Speakers
avatar for Jack Xue

Jack Xue

Principal Cloud Solution Architect, Microsoft
PhD & MBA. Principal Cloud Solution Architect, Microsoft
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

6:00pm MST

🪧 Poster Session (PS06): What's Happening with SPIFFE and WIMSE? - Daniel Feldman, Qusaic
Wednesday November 13, 2024 6:00pm - 8:00pm MST
This session will be a very brief overview of what's going on with the SPIFFE and WIMSE identity standards projects. SPIFFE is a CNCF effort to standardize workload identity implementations. That is, a SPIFFE implementation can grant services unique identities and credentials. WIMSE is an IETF effort to build on the SPIFFE foundation. In particular, it adds a new, unique token format that allows securely recording multi-hop identity information. Implementors will be able to use this token format to build complete, end-to-end, cryptographically auditable identity records.
Speakers
avatar for Daniel Feldman

Daniel Feldman

Founder, Qusaic
Daniel Feldman has worked with many companies, large and small, to deploy SPIFFE and SPIRE zero-trust identity.
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase
  🪧 Poster Sessions, Security

6:00pm MST

🪧 Poster Session (PS07): Unleashing the Power of Prediction to Proactively Scale Control Plane Components - Anubhav Aeron & Ryan Tay, Intuit
Wednesday November 13, 2024 6:00pm - 8:00pm MST
At Intuit, our control plane components such as IstioD are responsible for hundreds of applications per cluster. It is responsible for configuring data plane, as well as injecting the istio-proxy container. With an increase in application traffic, there is an increase in application pods, which results in the control plane to scale up. For critical control planes such as IstioD, it is wise to scale proactively, rather than as a reaction to increase in load. With traditional approaches, like tuning HPA thresholds, to scale in advance, we might pre scale even when not required due to outliers, which could be wasteful. At Intuit a novel deep learning forecasting model called N-HiTS was employed to solve this issue. This session will discuss and demo how we train N-HiTS, our most important model features, and how we deploy our service on a per-cluster basis to provide contextualized predictions for cost effective and on time auto-scaling.
Speakers
avatar for Anubhav Aeron

Anubhav Aeron

Senior Staff SE, Coupang
Anubhav is a seasoned software engineer in the field of Cloud Native Technologies, and has been doing Kubernetes and Service Mesh since 2016. He developed Redis Cluster as a Service, and a Templating Engine while working at Yahoo! He is the lead maintainer of Admiral, which is an... Read More →
avatar for Ryan Tay

Ryan Tay

Software Engineer, Intuit Inc.
As a software engineer on the Service Mesh team at Intuit, Ryan works to support Intuit's extensive Istio deployment through contributions to projects like Admiral. He has previously worked to reduce costs of cloud development environments for the Intuit API Gateway team. His main... Read More →
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

6:00pm MST

🪧 Poster Session (PS09): Kubernetes as a Geographically Distributed System - Chris Friesen, Wind River Systems
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Kubernetes was designed to be the best container orchestration platform on top of a cloud infrastructure in one data center. What do you do when you want to take your deployment and grow it in various geographical locations, but sill keep it as part of one system? You will have to face with complexity and figure out infrastructure management on a massive scale, and neither of these is easy to tackle. However, you don't have to go back to the drawing board, because the platform that delivers on requirements and expectations, already exists and it is called StarlingX. The StarlingX project is a fully integrated, open source cloud platform that is running in production at large telecom operators, who rely on its distributed cloud architecture along with next-level container orchestration support, which is provided by Kubernetes. This talk will introduce the StarlingX platform, share highlights from its latest release and show how it takes Kubernetes to the next level!
Speakers
CF

Chris Friesen

Member of Technical Staff, Wind River Systems
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase

6:00pm MST

🪧 Poster Session (PS10): Accepting Mortality: Strategies for Ultra-Long Running Stateful Workloads in K8s - Sebastian Beyvers & Maria Hansen, Giessen University
Wednesday November 13, 2024 6:00pm - 8:00pm MST
"Pods are mortal" is a well-known quote in the official Kubernetes documentation. For ultra-long running stateful workloads that take months to complete, this mortality comes with its own challenges. How do you react to hardware failures? What resource quotas are appropriate? What if the workload has no built-in persistence and does all its work in memory? For such workloads, failures can be fatal, potentially wiping out months of work. This session will show that despite all the obstacles, Kubernetes can still be a reasonable choice for running stateful workloads that take months to complete. Using real-world examples based on production workflows, we will show how we design, configure, run, and operate such workloads using K8s and Argo workflows. We will also show how intelligent checkpointing using CRIU can help us deal with failures and enables us to avoid some problems even before they occur.
Speakers
avatar for Sebastian Beyvers

Sebastian Beyvers

Distributed Systems Researcher, Giessen University
Sebastian Beyvers is a distributed systems researcher in bioinformatics and a cloud-native Rust developer at Giessen University. Sebastian's current work focuses on cloud-native data storage and processing solutions that try to harmonize existing national and international data ecosystems... Read More →
avatar for Maria Hansen

Maria Hansen

Research Associate, Giessen University
Maria Hansen is a research assistant in the field of (bio)informatics at Justus Liebig University Giessen. She is currently working on a cloud-native data orchestration system that aims to unite existing national and international data ecosystems.
Wednesday November 13, 2024 6:00pm - 8:00pm MST
Salt Palace | Level 1 | Halls A-C + 1-5 | Solutions Showcase
 

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date - 
  • 🚨 Contribfest
  • 🪧 Poster Sessions
  • AI + ML
  • Breaks
  • ⚡ Lightning Talks
  • Cloud Native Experience
  • Cloud Native Novice
  • CNCF-hosted Co-located Events
  • Connectivity
  • Data Processing + Storage
  • Diversity + Equity + Inclusion
  • Emerging + Advanced
  • Experiences
  • Keynote Sessions
  • Maintainer Track
  • Observability
  • Operations + Performance
  • Platform Engineering
  • Project Opportunities
  • Registration
  • SDLC
  • Security
  • Solutions Showcase
  • Sponsor-hosted Co-located Event
  • Tutorials