Name: Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines on K8s - Meenakshi Kaushik & Shiva Krishna Merla, NVIDIA
Start: 2024-11-15T16:00:00-0700
End: 2024-11-15T16:35:00-0700

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Friday November 15, 2024 4:00pm - 4:35pm MST

Salt Palace | Level 2 | 250 AD

In this session, we'll cover best practices for deploying, scaling, and managing LLM inference pipelines on Kubernetes (K8s). We'll explore common patterns like inference, retrieval-augmented generation (RAG), and fine-tuning. Key challenges addressed include: [1]. Minimizing initial inference latency with model caching [2] Optimizing GPU usage with efficient scheduling, multi-GPU/node handling, and auto-quantization [3] Enhancing security and management with RBAC, monitoring, auto-scaling, and support for air-gapped clusters We'll also demonstrate building customizable pipelines for inference, RAG, and fine-tuning, and managing them post-deployment. Solutions include [1] a lightweight standalone tool built using operator pattern and [2] KServe, a robust open-source AI inference platform. This session will equip you to effectively manage LLM inference pipelines on K8s, improving performance, efficiency, and security

Speakers

Meenakshi Kaushik

Product Management, Nvidia

Meenakshi Kaushik leads product management for NIM Operator and KServe.. Meenakshi is interested in the AI and ML space and is excited to see how the technology can enhance human well-being and productivity.

Shiva Krishna Merla

Senior Software Engineer, NVIDIA

Shiva Krishna Merla is a senior software engineer on the NVIDIA Cloud Native team where he works on GPU cloud infrastructure, orchestration and monitoring. He is focused on enabling GPU-accelerated DL and AI workloads in container orchestration systems such as Kubernetes and OpenShift... Read More →

best practices llm inference rag finetuning pdf

Friday November 15, 2024 4:00pm - 4:35pm MST
Salt Palace | Level 2 | 250 AD

AI + ML

Content Experience Level Beginner

KubeCon + CloudNativeCon North America 2024

Meenakshi Kaushik

Shiva Krishna Merla

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!