Name: Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster - Yuichiro Ueno & Toru Komatsu, Preferred Networks, Inc.
Start: 2024-11-14T14:30:00-0700
End: 2024-11-14T15:05:00-0700

In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Thursday November 14, 2024 2:30pm - 3:05pm MST

Salt Palace | Level 1 | Grand Ballroom A

Today, storage technologies play a fundamental role in the realm of AI/ML. Read performance is essential for swiftly moving datasets from storage to AI accelerators. However, the rapid enhancement of AI accelerators' performance often outpaces I/O, bottlenecks the training. Due to the scheduling of pods in Kubernetes across multiple nodes, utilizing node-local storage effectively presents a challenge. To address this, we introduce a distributed cache system built atop node-local storages, designed for AI/ML workloads. This cache system has been successfully deployed on our on-premise 1024+ GPUs Kubernetes cluster within a multi-tenancy environment. Throughout our two-year experience operating this cache system, we have overcome numerous hurdles across several components, including the I/O library, load balancers, and the storage backend. We will share the challenges and the solutions we implemented, leading to a system delivering 50+ GB/s throughput and less than 2ms latency.

Speakers

Toru Komatsu

Engineer, Preferred Networks, Inc.

Toru is a machine learning platform engineer at Preferred Networks in Japan. He is the creator and lead developer of youki, an OCI Runtime in Rust, and a maintainer of the OCI Runtime Specification. Additionally, he serves as a reviewer for runwasi and is involved in developing a world that utilizes containers and Wasm. Additionally, he is a member of the Kubernetes org and is especially interested in... Read More →

Yuichiro Ueno

Engineer, Preferred Networks, Inc.

He is currently a machine learning platform engineer at Preferred Networks in Japan. His research and engineering interests include a range of high-performance computing (distributed deep learning, networking/RDMA, and storage technologies), performance engineering, and Kubernete... Read More →

KubeCon NA 2024 Distributed Cache Empowers AI ML Workloads on Kubernetes Cluster pdf

Thursday November 14, 2024 2:30pm - 3:05pm MST
Salt Palace | Level 1 | Grand Ballroom A

Data Processing + Storage

Content Experience Level Intermediate

KubeCon + CloudNativeCon North America 2024

Toru Komatsu

Yuichiro Ueno

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!