KubeCon + CloudNativeCon North America 2024: Full Schedule

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

arrow_back View All Dates

11:00am MST

From Vectors to Pods: Integrating AI with Cloud Native - Rajas Kakodkar, Broadcom; Kevin Klues, NVIDIA; Joseph Sandoval, Adobe; Ricardo Rocha, CERN; Dawn Chen, Google

Thursday November 14, 2024 11:00am - 11:35am MST

Salt Palace | Level 2 | 255 E

The rise of AI is challenging long-standing assumptions about running cloud native workloads. AI demands hardware accelerators, vast data, efficient scheduling and exceptional scalability. Although Kubernetes remains the de facto choice, feedback from end users and collaboration with researchers and academia are essential to drive innovation, address gaps and integrate AI in cloud native. This panel features end users, AI infra researchers and leads of the CNCF AI and Kubernetes device management working groups focussed on: - Expanding beyond LLMs to explore AI for cloud native workload management, memory usage and debugging - Challenges with scheduling and scaling of AI workloads from the end user perspective - OSS Projects and innovation in AI and cloud native in the CNCF landscape - Improving resource utilisation and performance of AI workloads The next decade of Kubernetes will be shaped by AI. We don’t yet know what this will look like, come join us to discover it together.

Speakers

Dawn Chen

Principal Software Engineer, Google

Dawn Chen is a principal software engineer at Google. Dawn has worked on Kubernetes and Google Container Engine (GKE) before the project was founded. She has been one of tech leads in both Kubernetes and GKE. Prior to Kubernetes, she was the one of the tech leads for Google internal... Read More →

Ricardo Rocha

Lead Platforms Infrastructure, CERN

Ricardo leads the Platform Infrastructure team at CERN with a strong focus on cloud native deployments and machine learning. He has led for several years the internal effort to transition services and workloads to use cloud native technologies, as well as dissemination and training... Read More →

Kevin Klues

Distinguished Engineer, NVIDIA

Kevin Klues is a distinguished engineer on the NVIDIA Cloud Native team. Kevin has been involved in the design and implementation of a number of Kubernetes technologies, including the Topology Manager, the Kubernetes stack for Multi-Instance GPUs, and Dynamic Resource Allocation (DRA... Read More →

Joseph Sandoval

Principal Product Manager, Adobe

Joseph Sandoval, a seasoned tech expert with 25 years in various roles running distributed systems, infrastructure platforms and thrives on empowering developers to scale their applications. An advocate for OpenSource software, he harnesses its transformative power to champion change... Read More →

Rajas Kakodkar

Staff Software Engineer at Broadcom | Tech Lead CNCF TAG Runtime, Broadcom

Rajas is a staff software engineer at Broadcom and a tech lead of the CNCF Technical Advisory Group, Runtime. He is actively involved in the AI working group in the CNCF. He is a Kubernetes contributor and has been a maintainer of the Kube Proxy Next Gen Project. He has also served... Read More →

Thursday November 14, 2024 11:00am - 11:35am MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Any

11:55am MST

Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet - Andrey Velichkevich, Apple & Yuki Iwai, CyberAgent, Inc.

Thursday November 14, 2024 11:55am - 12:30pm MST

Salt Palace | Level 2 | 255 E

Running model training on Kubernetes is challenging due to the complexity of AI/ML models, large training datasets, and various distributed strategies like data and model parallelism. It is crucial to configure failure handling, success criteria, and gang-scheduling for large-scale distributed training to ensure fault tolerance and elasticity. This talk will introduce the new Kubeflow TrainJob API, which democratizes distributed training and LLM fine-tuning on Kubernetes. The speakers will demonstrate how TrainJob integrates with Kubernetes JobSet to ensure scalable and efficient AI model training with simplified Python experience for Data Scientists. Additionally, they will explain the innovative concept of reusable and extendable training runtimes within TrainJob. The speakers will highlight how these capabilities empower data scientists to rapidly iterate on their ML development, making Kubernetes more accessible and beneficial for the entire ML ecosystem.

Speakers

Andrey Velichkevich

Senior Software Engineer, Apple

Andrey Velichkevich is a Senior Software Engineer at Apple and is a key contributor to the Kubeflow open-source project. He is a member of Kubeflow Steering Committee and a co-chair of Kubeflow AutoML and Training WG. Additionally, Andrey is an active member of the CNCF WG AI. He... Read More →

Yuki Iwai

Software Engineer, CyberAgent, Inc.

Yuki is a Software Engineer at CyberAgent, Inc. He works on the internal platform for machine-learning applications and high-performance computing. He is currently a Technical Lead for Kubeflow WG AutoML / Training. He is also a Kubernetes WG Batch active member, Job API reviewer... Read More →

KubeCon NA 2024. Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet. Andrey Velichkevich and Yuki I pdf

Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Any

2:30pm MST

Unlocking Potential of Large Models in Production - Yuan Tang, Red Hat & Adam Tetelman, NVIDIA

Thursday November 14, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 2 | 255 E

The recent paradigm shift from traditional ML to GenAI and LLMs has brought with it a new set of non-trivial LLMOps challenges around deployment, scaling, and operations that make building an inference platform to meet all business requirements an unsolved problem. This talk highlights these new challenges along with best-practices and solutions for building out large, scalable, and reliable inference platforms on top of cloud native technologies such as Kubernetes, Kubeflow, Kserve, and Knative. Which tools help effectively benchmark and assess the quality of an LLM? What type of storage and caching solutions enable quick auto-scaling and model downloads? How can you ensure your model is optimized for the specialized accelerators running in your cluster? How can A/B testing or rolling upgrades be accomplished with limited compute? What exactly do you monitor in an LLM? In this session we will use KServe as a case study to answer these questions and more.

Speakers

Yuan Tang

Principal Software Engineer, Red Hat

Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led AI infrastructure and platform teams at various companies. He holds leadership positions in open source projects, including Argo, Kubeflow, and Kubernetes. He's also a maintainer and... Read More →

Adam Tetelman

Principal Product Architect, NVIDIA

Adam Tetelman is a principal architect at NVIDIA leading cloud native initiatives and CNCF engagements across the company; building inference platforms for NVIDIA AI Enterprise and DGX Cloud. He has degrees in computational robotics, computer & systems engineering, and cognitive science... Read More →

KubeCon NA 2024 Yuan and Adam Unlocking Potential of Large Models in Production.pptx 1 pdf

Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Intermediate

3:25pm MST

Unlocking the Future of GPU Scheduling in Kubernetes with Reinforcement Learning - Nikunj Goyal, Adobe Systems & Aditi Gupta, Disney Plus Hotstar

Thursday November 14, 2024 3:25pm - 4:00pm MST

Salt Palace | Level 2 | 255 E

Scaling up Multi GPU setup using Kubernetes for large scale ML projects has been a hot topic equally stressed upon among both the AI and cloud community. While Kubernetes is able to providing computing power by scheduling GPU nodes, certain issues like resource fragmentation and low utilization plague the performance and results in cost issues. Why Reinforcement Learning (RL) in particular one would ask. Unlike the other algorithms, RL shines in its unique ability to continuously adapt to changing environments and efficiently handle Complex and Multi-dimensional Objectives making it particularly suitable for the dynamic and heterogeneous nature of Kubernetes clusters. In this talk, we shall explore the current landscape of GPU scheduling and some state of the art RL algorithms proposed for scheduling. Their current impact on Kubernetes and the possible use of RLHF shall be dived deep into. We hope that audience gain more insights into these new ways of scheduling GPUs on Kubernetes.

Speakers

Aditi Gupta

Aditi Gupta, Software Developer Engineer

I'm Aditi Gupta, a Software Developer Engineer. Graduated from Asia's largest tech college for women, Indira Gandhi Delhi Technical University,I've been deeply immersed in cloud-native technologies and AI/ML advancements. Skilled in containerisation, micro-service architecture, and... Read More →

Nikunj Goyal

Developer at Adobe, Adobe Systems

Hi, I am Nikunj Goyal, working as a developer at Adobe and a Maths major from IIT Roorkee. I am working with AI and Machine Learning for some time mainly with Generative AI and graph based methods. I am a core part of Text-to-vector generation team at my org and previously worked... Read More →

K8RLAditiNikunj pdf

Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Beginner

4:30pm MST

Which GPU Sharing Strategy Is Right for You? A Comprehensive Benchmark Study Using DRA - Kevin Klues & Yuan Chen, NVIDIA

Thursday November 14, 2024 4:30pm - 5:05pm MST

Salt Palace | Level 2 | 255 E

Dynamic Resource Allocation (DRA) is one of the most anticipated features to ever make its way into Kubernetes. It promises to revolutionize the way hardware devices are consumed and shared between workloads. In particular, DRA unlocks the ability to manage heterogeneous GPUs in a unified and configurable manner without the need for awkward solutions shoehorned on top of the existing device plugin API. In this talk, we use DRA to benchmark various GPU sharing strategies including Multi-Instance GPUs, Multi-Process Service (MPS), and CUDA Time-Slicing. As part of this, we provide guidance on the class of applications that can benefit from each strategy as well as how to combine different strategies in order to achieve optimal performance. The talk concludes with a discussion of potential challenges, future enhancements, and a live demo showcasing the use of each GPU sharing strategy with real-world applications.

Speakers

Kevin Klues

Distinguished Engineer, NVIDIA

Yuan Chen

Principal Software Engineer, NVIDIA

Yuan Chen is a Principal Software Engineer at NVIDIA, working on building NVIDIA GPU Cloud for AI. He served as a Staff Software Engineer at Apple from 2019 to 2024, where he contributed to the development of Apple's Kubernetes infrastructure. Yuan has been an active code contributor... Read More →

KCNA24 DRA Benchmarking pdf

Thursday November 14, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Any

5:25pm MST

Managing and Distributing AI Models Using OCI Standards and Harbor - Steven Zou & Steven Ren, VMware by Broadcom

Thursday November 14, 2024 5:25pm - 6:00pm MST

Salt Palace | Level 2 | 255 E

Just as container images are vital to cloud-native technology, AI models are crucial to AI technology. Effectively, conveniently, and safely managing, maintaining, and distributing AI models is critical for supporting workflows like AI model training, inference, and application deployment. This presentation explores AI model management based on OCI standards and the Harbor project. Standardizing AI model structures and characteristics using OCI specifications and extension mechanisms like OCI Reference to link datasets and dependencies. When large models require efficient loading or privacy considerations, model replication or proxy with upstream repositories like Hugging Face becomes essential. Enhancing model distribution security through signing, vulnerability scanning, and policy-based governance is often necessary. Additionally, introducing acceleration mechanisms such as P2P can significantly improve the efficiency of large model loading.

Speakers

Steven Ren

Senior Manager, Broadcom

Steven Zou

Staff II Engineer, VMware by Broadcom

Steven Zou is a senior engineer with years of experience in cloud computing and cloud-native technology. He is currently working as a Staff II engineer at VMware, focusing on cloud-native and Kubernetes-related platform services. In addition, he is a core maintainer of the CNCF open-source... Read More →

Managing and Distributing AI Models Using OCI Standards and Harbor.pptx (2) pdf

Thursday November 14, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 2 | 255 E

AI + ML

Content Experience Level Any