Track Energy & Emissions of User Jobs on HPC/AI Platforms using CEEMS
H.1308 (Rolin) | Day 2 | 11:30 - 11:55 | Speakers: Mahendra Paipuri
Abstract
With the rapid acceleration of ML/AI research in the last couple of years, the already energy-hungry HPC platforms have become even more demanding. A major part of this energy consumption is due to users’ workloads and it is only by the participation of end users that it is possible to reduce the overall energy consumption of the platforms. However, most of the HPC platforms do not provide any sort of metrics related to energy consumption, nor the performance metrics out of the box, which in turn do not encourage end users to optimize their workloads.
The Compute Energy & Emissions Monitoring Stack (CEEMS) has been designed to address this issue. CEEMS can report energy consumption and equivalent emissions of user workloads in real time for SLURM (HPC), Openstack (Cloud) and Kubernetes platforms alike. It leverages the Linux perf subsystem and eBPF to monitor the performance metrics of the applications, which can help the end users to identify the bottlenecks in their workflows rapidly and consequently optimize them to reduce the energy and carbon footprint. CEEMS supports eBPF-based continuous profiling and it is the first monitoring stack to support continuous profiling on HPC platforms. Another advantage of CEEMS is that it can systematically monitor all the jobs on the platform without the end users having to modify their workflows or codes.
Besides CPU energy usage, it supports reporting energy usage and performance metrics of workloads on NVIDIA and AMD GPU accelerators. CEEMS has been built around the prominent open-source tools in the observability ecosystem, like Prometheus and Grafana. CEEMS has been designed to be extensible and it allows the HPC center operators to easily define the energy estimation rules of user workloads based on the underlying hardware. CEEMS monitors I/O and network metrics in a file system agnostic manner, allowing it to work on any parallel file system used by HPC platforms. Finally, the talk will conclude by showing how CEEMS monitoring is used on the Jean-Zay HPC platform with more than 2000 nodes that have a daily job churn rate of around 20k jobs.
Attachments
Speakers
Mahendra has a doctorate in applied mathematics from Universidade de Lisboa, Portugal and M.Sc. in computational sciences from Universitat Politecnica de Catalunya, Barcelona.
After his doctorate, he did his post-doctorate at Universite Gustav Eiffel, working with an ERC project focused on macroscopic modelling of urban transportation networks. Later, he worked for INRIA as a research engineer within SKAO on software-hardware co-design activities for SDP.
Since the beginning of 2022, he has been working for CNRS as a permanent research engineer. He spent more than 3 years at the national HPC center of CNRS as a system/solutions architect. Mahendra joined CDSP in October 2025 to lead the digital projects team.
Links
External Links
Notice: The placeholder video image is licensed under CC BY-SA 4.0. The original image can be found hereChanges made to the image are: Cropped the image to a new ratio, part of the image was cut off.
