Loading…
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
or to bookmark your favorites and sync them to your phone or calendar.
strong>Maintainer Track [clear filter]
Friday, November 15
 

11:00am MST

Bloomberg's Journey to Manage a Multi-Cluster Training Application with Karmada - Yifan Zhang & Wei-Cheng Lai, Bloomberg
Friday November 15, 2024 11:00am - 11:35am MST
Bloomberg provides an on-premises Data Science Platform using cloud-native software to support internal AI model training. It runs on Kubernetes spanning multiple data centers and featuring a diverse range of GPU types. However, managing such a large-scale and heterogeneous GPU environment poses many challenges, such as improving resource utilization, reducing operational costs, and scheduling workloads across different GPU types. In collaboration with the Karmada community, Bloomberg's Data Science Platform team has aimed to tackle these challenges by addressing multi-cluster batch job management problems. This talk will delve into the approaches the team has adopted, including: - Intelligently scheduling GPU workloads across multiple clusters - Using Karmada's resource interpreter to support Custom Resource Definitions (CRDs) on top of a multi-cluster architecture - Building a highly available Karmada control plane - Establishing a consistent training job submission interface
Speakers
avatar for Yifan Zhang

Yifan Zhang

Software Engineer, Bloomberg
Yifan Zhang is a Software Engineer on Bloomberg’s Data Science Platform engineering team, which is focused on building a reliable machine learning platform to support the company’s internal model training in an interactive environment based on Jupyter notebooks. Yifan received... Read More →
avatar for Wei-Cheng Lai

Wei-Cheng Lai

Software Engineer, Bloomberg
Wei-Cheng Lai is a software engineer on Bloomberg's Data Science Platform Engineering team. With a background in machine learning and a Master of Engineering degree in Electrical and Computer Engineering from UIUC. He is now focusing on building ML training platforms on Kubernetes... Read More →
Friday November 15, 2024 11:00am - 11:35am MST
Hyatt Regency | Level 4 | Regency Ballroom B
 

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
  • 🚨 Contribfest
  • 🪧 Poster Sessions
  • AI + ML
  • Breaks
  • ⚡ Lightning Talks
  • Cloud Native Experience
  • Cloud Native Novice
  • CNCF-hosted Co-located Events
  • Connectivity
  • Data Processing + Storage
  • Diversity + Equity + Inclusion
  • Emerging + Advanced
  • Experiences
  • Keynote Sessions
  • Maintainer Track
  • Observability
  • Operations + Performance
  • Platform Engineering
  • Project Opportunities
  • Registration
  • SDLC
  • Security
  • Solutions Showcase
  • Sponsor-hosted Co-located Event
  • Tutorials