Loading…
Attending this event?
In-person
November 12-15
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Mountain Standard Time (UTC -7). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
⚡ Lightning Talks clear filter
Tuesday, November 12
 

5:45pm MST

⚡ Lightning Talk: Evaluating Scheduler Efficiency for AI/ML Jobs Using Custom Resource Metrics - Dmitry Shmulevich, NVIDIA
Tuesday November 12, 2024 5:45pm - 5:50pm MST
Kubernetes deployments frequently utilize custom resources beyond just CPU and memory, such as GPUs, which are essential for AI/ML workloads. While the Metrics API offers insights into CPU and memory usage at both the pod and node levels, it does not provide similar information for custom resources. Although resource requests for custom resources are specified in the pod spec, there is no visibility into how efficiently these resources are utilized at the node and cluster levels. To address this gap, we developed a Prometheus Node Resource Exporter tailored to monitor custom resources. Our case study focuses on evaluating the efficiency of Kubernetes schedulers when handling a high volume of AI/ML jobs, using GPU occupancy on the nodes as the primary indicator. In this lightning talk, we will present a comparative analysis of several scheduling frameworks based on the metrics collected by our custom exporter.
Speakers
avatar for Dmitry Shmulevich

Dmitry Shmulevich

Software Engineer, NVIDIA
Dmitry is a software engineer at NVIDIA with over 25 years of experience in software development, specializing in cloud computing for the past eight years. Throughout his career, he has made significant contributions to various systems and projects across the cloud stack. He is also... Read More →
Tuesday November 12, 2024 5:45pm - 5:50pm MST
Hyatt Regency | Level 4 | Regency Ballroom BCD
  ⚡ Lightning Talks, Observability
  • Content Experience Level Any

6:05pm MST

⚡ Lightning Talk: Minimizing Data Loss Within the OpenTelemetry (OTel) Collector - Alex Kats, Capital One
Tuesday November 12, 2024 6:05pm - 6:10pm MST
The OTel collector is meant to serve as a reliable and highly performant data pipeline. However, as a single component in a wider observability architecture, it is only as reliable as the downstream platforms/services it exports data to. The OTel collector has several built in mechanisms that aim to minimize the impact of unhealthy downstream exporters, including an out of the box sending queue with an additional configuration parameter for persistent queueing. There is a new component in the OTel contrib distribution, the Failover Connector. The Failover Connector allows for dynamic routing or “failover” of telemetry data based on downstream exporter health. This provides significant improvement to the data resiliency of the collector, as telemetry data can be continuously exported to a set of stable secondary locations, while the issues with the primary are resolved.
Speakers
avatar for Alex Kats

Alex Kats

Software Engineer, Capital One
Alex is a software engineer at Capital One. Alex has significant experience within the Observability space, with an emphasis on OpenTelemetry (OTel). Alex is a member of the OpenTelemetry community and has been contributing to various components within the OTel toolset.
Tuesday November 12, 2024 6:05pm - 6:10pm MST
Hyatt Regency | Level 4 | Regency Ballroom BCD
 

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
  • 🚨 Contribfest
  • 🪧 Poster Sessions
  • AI + ML
  • Breaks
  • ⚡ Lightning Talks
  • Cloud Native Experience
  • Cloud Native Novice
  • CNCF-hosted Co-located Events
  • Connectivity
  • Data Processing + Storage
  • Emerging + Advanced
  • Experiences
  • Keynote Sessions
  • Maintainer Track
  • Observability
  • Operations + Performance
  • Platform Engineering
  • Project Opportunties
  • Registration
  • SDLC
  • Security
  • Solutions Showcase
  • Sponsor-hosted Co-located Event
  • Tutorials