RLP: Power Management Based on a Latency-Aware Roofline Model

21/12/2023

Energy optimization icons Copyright: © Freepik

Modern high-performance computing (HPC) clusters consume energy at large scales. Such demand for energy consumption imposes several challenges on the surrounding infrastructure, e.g., cooling systems, and power supply. Moreover, advanced infrastructure orchestration is necessary for optimal power dissipation regulation. Long-term operation of large-scale HPC clusters requires efficient power management strategies for reducing power consumption and, subsequently, the carbon footprint.

The recently accepted work at the prestigious 37th IEEE International Parallel and Distributed Processing Symposium (IPDPS) proposed a new performance model that tackles this challenge at the node-level granularity. The model extends a well-known Roofline model by taking the memory access latency costs into account to manage power consumption more efficiently. The runtime based on that model identifies whether the workflow is bandwidth, latency, or compute-bound and applies a corresponding energy optimization policy.

The portable power management model, thanks to the generic performance counters available on most HPC systems, constructs the latency-aware roofline model (RLP) dynamically at runtime which allows on-the-fly analysis and power management. The study evaluated real-world HPC workloads on server CPUs and a GPU in two scenarios: optimization with and without power capping. Compared to the system-default settings, RLP reduces the energy-to-solution up to 22% and up to 14.7% under power capping. Additionally, RLP outperforms the current state-of-the-art in generality and effectiveness.

Parts of this work have been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 446185093 (H2M project).

The publication was published on IEEE Xplore.

Further information about the project can be found on the project website.