Loading…
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
strong>AI + ML [clear filter]
arrow_back View All Dates
Thursday, November 14
 

11:55am MST

Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet - Andrey Velichkevich, Apple & Yuki Iwai, CyberAgent, Inc.
Thursday November 14, 2024 11:55am - 12:30pm MST
Running model training on Kubernetes is challenging due to the complexity of AI/ML models, large training datasets, and various distributed strategies like data and model parallelism. It is crucial to configure failure handling, success criteria, and gang-scheduling for large-scale distributed training to ensure fault tolerance and elasticity. This talk will introduce the new Kubeflow TrainJob API, which democratizes distributed training and LLM fine-tuning on Kubernetes. The speakers will demonstrate how TrainJob integrates with Kubernetes JobSet to ensure scalable and efficient AI model training with simplified Python experience for Data Scientists. Additionally, they will explain the innovative concept of reusable and extendable training runtimes within TrainJob. The speakers will highlight how these capabilities empower data scientists to rapidly iterate on their ML development, making Kubernetes more accessible and beneficial for the entire ML ecosystem.
Speakers
avatar for Andrey Velichkevich

Andrey Velichkevich

Senior Software Engineer, Apple
Andrey Velichkevich is a Senior Software Engineer at Apple and is a key contributor to the Kubeflow open-source project. He is a member of Kubeflow Steering Committee and a co-chair of Kubeflow AutoML and Training WG. Additionally, Andrey is an active member of the CNCF WG AI. He... Read More →
avatar for Yuki Iwai

Yuki Iwai

Software Engineer, CyberAgent, Inc.
Yuki is a Software Engineer at CyberAgent, Inc. He works on the internal platform for machine-learning applications and high-performance computing. He is currently a Technical Lead for Kubeflow WG AutoML / Training. He is also a Kubernetes WG Batch active member and a Kubernetes... Read More →
Thursday November 14, 2024 11:55am - 12:30pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML
  • Content Experience Level Any

2:30pm MST

Unlocking Potential of Large Models in Production - Yuan Tang, Red Hat & Adam Tetelman, NVIDIA
Thursday November 14, 2024 2:30pm - 3:05pm MST
The recent paradigm shift from traditional ML to GenAI and LLMs has brought with it a new set of non-trivial LLMOps challenges around deployment, scaling, and operations that make building an inference platform to meet all business requirements an unsolved problem. This talk highlights these new challenges along with best-practices and solutions for building out large, scalable, and reliable inference platforms on top of cloud native technologies such as Kubernetes, Kubeflow, Kserve, and Knative. Which tools help effectively benchmark and assess the quality of an LLM? What type of storage and caching solutions enable quick auto-scaling and model downloads? How can you ensure your model is optimized for the specialized accelerators running in your cluster? How can A/B testing or rolling upgrades be accomplished with limited compute? What exactly do you monitor in an LLM? In this session we will use KServe as a case study to answer these questions and more.
Speakers
avatar for Yuan Tang

Yuan Tang

Principal Software Engineer, Red Hat
Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led AI infrastructure and platform teams at various companies. He holds leadership positions in open source projects, including Argo, Kubeflow, and Kubernetes. He's also a maintainer and... Read More →
avatar for Adam Tetelman

Adam Tetelman

Principal Product Architect, NVIDIA
Adam Tetelman is a principal architect at NVIDIA leading cloud native initiatives and CNCF engagements across the company; building inference platforms for NVIDIA AI Enterprise and DGX Cloud. He has degrees in computational robotics, computer & systems engineering, and cognitive science... Read More →
Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

3:25pm MST

Unlocking the Future of GPU Scheduling in Kubernetes with Reinforcement Learning - Nikunj Goyal, Adobe Systems & Aditi Gupta, Disney Plus Hotstar
Thursday November 14, 2024 3:25pm - 4:00pm MST
Scaling up Multi GPU setup using Kubernetes for large scale ML projects has been a hot topic equally stressed upon among both the AI and cloud community. While Kubernetes is able to providing computing power by scheduling GPU nodes, certain issues like resource fragmentation and low utilization plague the performance and results in cost issues. Why Reinforcement Learning (RL) in particular one would ask. Unlike the other algorithms, RL shines in its unique ability to continuously adapt to changing environments and efficiently handle Complex and Multi-dimensional Objectives making it particularly suitable for the dynamic and heterogeneous nature of Kubernetes clusters. In this talk, we shall explore the current landscape of GPU scheduling and some state of the art RL algorithms proposed for scheduling. Their current impact on Kubernetes and the possible use of RLHF shall be dived deep into. We hope that audience gain more insights into these new ways of scheduling GPUs on Kubernetes.
Speakers
avatar for Aditi Gupta

Aditi Gupta

Aditi Gupta, Software Developer at Disney + Hotstar, Disney Plus Hotstar
I'm Aditi Gupta, a Software Developer Engineer at Disney+ Hotstar. Graduated from Asia's largest tech college for women, Indira Gandhi Delhi Technical University,I've been deeply immersed in cloud-native technologies and AI/ML advancements. Skilled in containerisation, micro-service... Read More →
avatar for Nikunj Goyal

Nikunj Goyal

Developer at Adobe, AI and Machine Learning Specialist, Adobe Systems
Hi, I am Nikunj Goyal, working as a developer at Adobe and a Maths major from IIT Roorkee. I am working with AI and Machine Learning for some time mainly with Generative AI and graph based methods. I am a core part of Text-to-vector generation team at my org and previously worked... Read More →
Thursday November 14, 2024 3:25pm - 4:00pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML

4:30pm MST

Which GPU Sharing Strategy Is Right for You? a Comprehensive Benchmark Study Using DRA - Kevin Klues & Yuan Chen, NVIDIA
Thursday November 14, 2024 4:30pm - 5:05pm MST
Dynamic Resource Allocation (DRA) is one of the most anticipated features to ever make its way into Kubernetes. It promises to revolutionize the way hardware devices are consumed and shared between workloads. In particular, DRA unlocks the ability to manage heterogeneous GPUs in a unified and configurable manner without the need for awkward solutions shoehorned on top of the existing device plugin API. In this talk, we use DRA to benchmark various GPU sharing strategies including Multi-Instance GPUs, Multi-Process Service (MPS), and CUDA Time-Slicing. As part of this, we provide guidance on the class of applications that can benefit from each strategy as well as how to combine different strategies in order to achieve optimal performance. The talk concludes with a discussion of potential challenges, future enhancements, and a live demo showcasing the use of each GPU sharing strategy with real-world applications.
Speakers
avatar for Kevin Klues

Kevin Klues

Distinguished Engineer, NVIDIA
Kevin Klues is a distinguished engineer on the NVIDIA Cloud Native team. Kevin has been involved in the design and implementation of a number of Kubernetes technologies, including the Topology Manager, the Kubernetes stack for Multi-Instance GPUs, and Dynamic Resource Allocation (DRA... Read More →
avatar for Yuan Chen

Yuan Chen

Principal Software Engineer, NVIDIA
Yuan Chen is a Principal Software Engineer at NVIDIA, working on building NVIDIA GPU Cloud for AI. He served as a Staff Software Engineer at Apple from 2019 to 2024, where he contributed to the development of Apple's Kubernetes infrastructure. Yuan has been an active code contributor... Read More →
Thursday November 14, 2024 4:30pm - 5:05pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML
  • Content Experience Level Any

5:25pm MST

Managing and Distributing AI Models Using OCI Standards and Harbor - Steven Zou & Steven Ren, VMware by Broadcom
Thursday November 14, 2024 5:25pm - 6:00pm MST
Just as container images are vital to cloud-native technology, AI models are crucial to AI technology. Effectively, conveniently, and safely managing, maintaining, and distributing AI models is critical for supporting workflows like AI model training, inference, and application deployment. This presentation explores AI model management based on OCI standards and the Harbor project. Standardizing AI model structures and characteristics using OCI specifications and extension mechanisms like OCI Reference to link datasets and dependencies. When large models require efficient loading or privacy considerations, model replication or proxy with upstream repositories like Hugging Face becomes essential. Enhancing model distribution security through signing, vulnerability scanning, and policy-based governance is often necessary. Additionally, introducing acceleration mechanisms such as P2P can significantly improve the efficiency of large model loading.
Speakers
avatar for Steven Ren

Steven Ren

Senior Manager, Broadcom
avatar for Steven Zou

Steven Zou

Staff II Engineer, VMware by Broadcom
Steven Zou is a senior engineer with years of experience in cloud computing and cloud-native technology. He is currently working as a Staff II engineer at VMware, focusing on cloud-native and Kubernetes-related platform services. In addition, he is a core maintainer of the CNCF open-source... Read More →
Thursday November 14, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 1 | Hall DE
  AI + ML
  • Content Experience Level Any

5:25pm MST

Navigating Failures in Pods with Devices: Challenges and Solutions - Sergey Kanzhelev, Google & Mrunal Patel, Red Hat
Thursday November 14, 2024 5:25pm - 6:00pm MST
Pods are no longer running with just CPU and Memory. We provision GPUs, network cards, request special placement of those devices and allocated memory. And the more efficient or effective you want your set up to be, the more complicated those device requirements are, the more chances you will hit an edge case Kubernetes has not accounted for yet. Come to the talk to learn from Node Maintainers about some of those shortcomings in Kubernetes. If you are only starting with AI/ML and devices, you will be interested to learn what to expect. If you have lots of experience, you may still learn new things. With the increased focus on AI/ML workloads, highlighting those scenarios is important. As Kubernetes plans to fix those problems, you can give feedback on what would work best for you.
Speakers
avatar for Sergey Kanzhelev

Sergey Kanzhelev

Staff Software Engineer, Google
Sergey Kanzhelev is a seasoned open source and cloud native maintainer working actively on Kubernetes. Sergey is serving as co-chair of SIG node. He is also one of the founders of OpenTelemetry. He is working on engineering aspect of software and its practical application. He is contributing... Read More →
avatar for Mrunal Patel

Mrunal Patel

Distinguished Engineer, Red Hat
Mrunal Patel is a Senior Principal Software Engineer at Red Hat working on containers for Openshift. He is a maintainer of runc/libcontainer and the OCI runtime specification. He started the CRI-O runtime. He is a SIG-Node chair and tech lead.
Thursday November 14, 2024 5:25pm - 6:00pm MST
Salt Palace | Level 2 | 250
  AI + ML
  • Content Experience Level Any
 

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date - 
  • 🚨 Contribfest
  • 🪧 Poster Sessions
  • AI + ML
  • Breaks
  • ⚡ Lightning Talks
  • Cloud Native Experience
  • Cloud Native Novice
  • CNCF-hosted Co-located Events
  • Connectivity
  • Data Processing + Storage
  • Emerging + Advanced
  • Experiences
  • Keynote Sessions
  • Maintainer Track
  • Observability
  • Operations + Performance
  • Platform Engineering
  • Project Opportunties
  • Registration
  • SDLC
  • Security
  • Solutions Showcase
  • Sponsor-hosted Co-located Event
  • Tutorials