Lessons from scaling BPF to detect RDMA Device Drivers Bugs in real time
H.1308 (Rolin) | Day 1 | 11:00 - 11:30 | Speakers: Prankur Gupta, Maksim Samoilov
Abstract
Training large models requires significant resources and failure of any GPU or Host can significantly prolong training times. At Meta, we observed that 17% of our jobs fail due to RDMA-related syscall errors which arise due to bugs in the RDMA driver code. Unlike other parts of the Kernel RDMA-related syscalls are opaque and the errors create a mismatched application/kernel view of hardware resources. As a result of this opacity and mismatch existing observability tools provided limited visibility and DevOps found it challenging to triage – we required a new scalable framework to analyze kernel state and identify the cause of this mismatch.
Direct approaches like tracing the kernel calls and capturing meta involved in the systems turned out to be prohibitively expensive. In this talk, we will describe the set of optimizations used to scale tracking kernel state and the map-based systems designed to efficiently export relevant state without impacting production workloads.
Attachments
Speakers
Seasoned software engineer with 12+ years of experience, specializing in robust, scalable solutions with a focus on reliability and observability. Prankur holds a master’s degree from Stony Brook University, where his academic focus was on Distributed and Parallel computing. Outside of work, Prankur enjoys mentoring and pursuing his passions for gaming and sports. He looks forward to sharing insights and engaging with fellow attendees on large-scale system reliability and AI-driven environments.
Production Engineering Manager / TLM at Meta, Network Infrastructure.
Also led kernel and host infrastructure team at Yandex in the past.
Links
External Links
Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.
