Skip to main content

Auto-instrumentation for GPU performance using eBPF

K.4.201 | Day 1 | 16:00 - 16:20 | Speakers: Annanay Agarwal, Marc Tuduri

Auto-instrumentation for GPU performance using eBPF
A picture of a devroom at FOSDEM 2024

Stream opens at 16:00 (Europe/Brussels)

Get involved in the conversation!Join the chat

Notes

Abstract

Modern AI workloads rely on large GPU fleets whose efficient utilisation is crucial due to high costs. However, gathering telemetry from these workloads to optimise performance is challenging because it requires manual instrumentation and adds performance overheads. Further, it does not produce telemetry in a standardised format for commonly used visualisation tools like Prometheus.

This talk explores the potential of leveraging eBPF to capture CUDA calls made to GPUs, including kernel launches and memory allocations. Data from these probes can be used to export Prometheus metrics, facilitating detailed analysis of kernel launch patterns and associated memory usage. This approach offers significant benefits as eBPF imposes minimal overhead and requires no intrusive instrumentation. Our implementation is also open-source and available on GitHub.

Attachments

Speakers

Annanay Agarwal
Marc Tuduri

Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.