Loading…
Attending this event?
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Friday November 15, 2024 11:00am - 11:35am MST
Bloomberg provides an on-premises Data Science Platform using cloud-native software to support internal AI model training. It runs on Kubernetes spanning multiple data centers and featuring a diverse range of GPU types. However, managing such a large-scale and heterogeneous GPU environment poses many challenges, such as improving resource utilization, reducing operational costs, and scheduling workloads across different GPU types. In collaboration with the Karmada community, Bloomberg's Data Science Platform team has aimed to tackle these challenges by addressing multi-cluster batch job management problems. This talk will delve into the approaches the team has adopted, including: - Intelligently scheduling GPU workloads across multiple clusters - Using Karmada's resource interpreter to support Custom Resource Definitions (CRDs) on top of a multi-cluster architecture - Building a highly available Karmada control plane - Establishing a consistent training job submission interface
Speakers
avatar for Yifan Zhang

Yifan Zhang

Software Engineer, Bloomberg
Yifan Zhang is a Software Engineer on Bloomberg’s Data Science Platform engineering team, which is focused on building a reliable machine learning platform to support the company’s internal model training in an interactive environment based on Jupyter notebooks. Yifan received... Read More →
avatar for Wei-Cheng Lai

Wei-Cheng Lai

Software Engineer, Bloomberg
Wei-Cheng Lai is a software engineer on Bloomberg's Data Science Platform Engineering team. With a background in machine learning and a Master of Engineering degree in Electrical and Computer Engineering from UIUC. He is now focusing on building ML training platforms on Kubernetes... Read More →
Friday November 15, 2024 11:00am - 11:35am MST
Hyatt Regency | Level 4 | Regency Ballroom B

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link