Topics
Suitable Performance Optimization Approaches for the SPMD IR
In high-performance computing (HPC), their compute cluster systems get increasingly larger and its architecture more and more heterogeneous (multiple compute nodes with CPUs, GPUs, …). This has the aim of satisfying the ever-growing computational demands (for example, from large simulations or AI models) that led to multiple different parallel programming models. These programming models are used in addition to programming languages to implement software that makes use of the fast and parallel HPC hardware efficiently and effectively. To ease the development of tools and compiler passes for parallel programming models following the single program, multiple data (SPMD) principle, the SPMD IR (intermediate representation) was introduced. It approches the problem that often tools only support one model or have implemented the necessary abstraction internally leading to a limited extensibility and reusability. The SPMD IR's prototype is implemented in MLIR/LLVM and supports MPI, NCCL, SHMEM, and NVSHMEM. While its usefulness is shown for the verification of collective communication and data race detection, the question remains which of the performance optimizing approaches or compiler passes in modern compiler systems can make use of it and are suitable and which are not applicable per-design.
This seminar thesis is supposed to conduct a systematic literature review of approaches that conduct performance optimization in the contexct of any of the supported programing models or are part of LLVM or GCC as compiler passes and could be leveraged by the additional information given by the SPMD IR. After analyzing and understanding the SPMD IR, the student is supposed to give an overview of the found approaches and discuss their applicability to the SPMD IR.
Kind of topic: overview
Supervisor: Semih Burak
Emerging MLIR Dialects and their Suitability for the SPMD IR
In high-performance computing (HPC), the compute cluster systems get increasingly larger and their architecture more and more heterogeneous (multiple compute nodes with CPUs, GPUs, …). This has the aim of satisfying the ever-growing computational demands (for example, from large simulations or AI models) that led to multiple different parallel programming models. These programming models are used in addition to programming languages to implement software that makes use of the fast and parallel HPC hardware efficiently and effectively. To ease the development of tools and compiler passes for parallel programming models following the single program, multiple data (SPMD) principle, the SPMD IR (intermediate representation) was introduced. It approches the problem that often tools only support one model or have implemented the necessary abstraction internally leading to a limited extensibility and reusability. The prototype of the SPMD IR is implemented on top of MLIR/LLVM and supports MPI, NCCL, SHMEM, and NVSHMEM. MLIR is evolving rapidly, with new dialects continuously emerging and existing ones being extended on a regular basis. In addition to the core upstream dialects, the SPMD IR introduces its own dedicated SPMD dialect.
This seminar thesis aims to conduct a systematic literature review of existing MLIR dialects. The student will analyze their usage and assess their applicability in combination with the SPMD dialect. As a representative example of a recently introduced dialect, the OpenSHMEM dialect should be examined in detail, and it should be discussed whether its concepts can be integrated into or inspire extensions of the SPMD IR.
Kind of topic: overview
Supervisor: Semih Burak
Quantifying Energy Consumption and Carbon Footprint of Large Language Model Inference
Large Language Models (LLMs) have shown remarkable capabilities across a wide spectrum of tasks, attracting many users worldwide, in the private sector as well as in academia or industry. While proprietary offerings from OpenAI, Anthropic, and Google deliver state‑of‑the‑art performance and cutting‑edge features, many public‑sector organizations, universities, research institutes, and sensitive domains such as healthcare are constrained by data‑privacy regulations that prohibit the use of commercial APIs for personal or patient data. To meet these stricter requirements, an increasing number of institutions are deploying locally hosted LLMs based on open‑source or open-weight models, thereby providing a data‑sovereign alternative to commercial services. While inference speed and throughput remain important, the energy consumption and resulting carbon emissions of these deployments have emerged as key sustainability issues. Further, inefficient prompting, such as overly long or poorly structured queries, can markedly raise computational load, driving up both energy consumption and operating costs. Implementing per‑prompt and regular (weekly or monthly) reporting of energy consumption, emissions, and costs can make users aware of the environmental and monetary impact of their LLM usage and encourage more sustainable prompting behavior.
This seminar thesis provides a systematic survey comparing several existing approaches and methodologies for quantifying the energy consumption and carbon footprint of various LLMs. It evaluates reported inference performance, energy demand, and emissions, and discusses the main factors influencing these metrics such as model size, hardware configuration, batch size, etc.
Kind of topic: overview
Supervisor: Jannis Klinkenberg
Improving Fine-Grained Task Parallelism and Dynamic Load Balancing on Multi-Socket Many-Core Systems
The continuous growth of core counts in modern cloud and HPC systems, which span across sockets and Non-Uniform Memory Access (NUMA) domains, poses increasing challenges for efficient application parallelization. To address this, shared-memory programming paradigms have been developed to simplify parallel programming by abstracting low-level details. A prominent example is OpenMP, which supports both work-sharing and task-based parallelism. In particular, task-based models are well suited for irregular, recursive, or complex workloads that can exploit fine-grained parallelism. However, as system complexity and core counts grow, managing synchronization and shared-memory access across many threads becomes increasingly difficult. Task creation, queuing, and scheduling introduce additional overheads, especially in fine-grained execution scenarios. Traditional OpenMP runtime systems such as GNU OpenMP and LLVM OpenMP often struggle with scalability due to their reliance on synchronization mechanisms like locks. Recent research has therefore explored alternative approaches, including lock-free and lock-less synchronization techniques as well as NUMA-aware scheduling strategies to improve efficiency.
This seminar thesis aims to provide an in-depth overview of advancements in OpenMP task scheduling, including key technical concepts and implementation strategies. Furthermore, it will critically evaluate and compare different task scheduling approaches and policies, particularly those used in GOMP and LLVM OpenMP, with respect to their performance and scalability.
Kind of topic: dive-in
Supervisor: Jannis Klinkenberg
Parallel File System Parameter Tuning
Modern High Performance Computing Systems need to facilitate highly performing file systems to their users. For this reason, parallel filesystems are used to enable concurrent accesses by a multitude of users and/or processes. These filesystems typically support a wide range of configuration parameters. Since different use cases require different parameter settings for optimal performance, it is not trivial to determine the best configuration for a given system.
For this seminar topic, the student will have to perform an extensive literature survey of different parameter tuning mechanisms. Several heuristics and automatic tuning processes have been proposed in previous research and the student should evaluate these mechanisms, especially with an eye to applicability for a university cluster like RWTH Aachen's CLAIX. There exists a possibility to try out these mechanisms with the ad-hoc filesystem BeeOND.
Kind of topic: overview
Supervisor: Philipp Martin
Utilisation of GPU Direct Storage in High Performance Computing
With modern systems, the HPC workloads using GPUs have seen a dramatic rise. A majority of these workloads in the Artificial Intelligence and Big Data segments also require access to large amounts of data. Traditionally, this data has to be loaded into system memory by the CPU and then transferred to the GPU memory to be utilised. GPU Direct Storage circumvents this additional data path by loading the data directly to GPU memory, bypassing the system memory.
In this seminar thesis, the student should evaluate the possible advantages of GPU Direct Storage and recent advancements in this area. This involves some literature review and a deep dive into the HPC storage stack. There may be an opportunity to try out GPU Direct Storage on the RWTH Aachen CLAIX cluster and to compare it to the traditional approach of transferring data to the GPUs.
Kind of topic: dive-in
Supervisor: Philipp Martin
Evaluating Checkpointing Mechanisms for Enhanced Fault Tolerance in ML/DL Training on HPC Systems
Checkpointing is a critical technique in distributed training of machine learning (ML) and deep learning (DL) models, aimed at recovering from failures that may occur during long-running computations. While frequent checkpointing allows for quick recovery, it can lead to significant performance overhead due to the generation of numerous checkpoints. Recent advancements such as differential checkpointing have shown potential in reducing these costs, making them more relevant for use on computation-time constraint systems like shared HPC clusters.
The seminar thesis will provide an overview of existing checkpointing mechanisms and evaluate their effectiveness in improving fault tolerance while minimizing computational overhead. It will compare traditional frequent checkpointing strategies with innovative approaches. The thesis will discuss implementation considerations for deploying these mechanisms in shared HPC environments.
Kind of topic: overview
Supervisor: Dominik Viehhauser
Assessing Mixed-Precision Benchmarking for State-Of-The-Art GPU Architectures
The rapid growth of machine‑learning workloads on large‑scale systems has driven a shift toward accelerator hardware that is optimized for low‑precision arithmetic (16‑bit floating‑point and below). Traditional HPC benchmarks such as the High‑Performance Linpack (HPL) evaluate only double‑precision performance and therefore no longer reflect the demands of modern applications. To address this gap, mixed‑precision variants—most notably HPL‑MxP and HPG‑MxP—have been introduced to benchmark low‑precision workloads.
This thesis investigates the current state of low‑precision benchmarks and evaluates how their different computational approaches affect their representativeness for HPC and ML applications. This also includes taking a look at if these benchmarks succeed capturing the increasing memory boundness of state-of-the-art accelerators.
Kind of topic: dive-in
Supervisor: Dominik Viehhauser
Evaluating Job Scheduler Modifications for HPC Sustainability Research
With the growing computational demands of scientific applications, the energy consumption and the resulting carbon emissions of HPC clusters are steadily increasing. Operators of HPC clusters seek to reduce their energy bill and carbon emissions by exploiting the natural variability of energy prices and carbon intensities of the energy mix. Energy prices and carbon intensities vary throughout the day due to the naturally fluctuating energy generation from renewable energy sources like solar and wind. To align cluster usage to these daily patterns, modifications to the job scheduler are necessary. Since the job scheduler provides the sole access point for users to submit work to the HPC cluster, any configuration changes need to be carefully evaluated before they can be deployed in production to ensure a stable operation.
Therefore, this thesis should compare different implementations and methodologies from related research for evaluating scheduler modifications. The discussion in this paper needs to address the following key aspects: Which modeling assumptions are made by each method? How accurately is the real system modeled and how does that influence the result accuracy? What data is required to employ the presented approaches? How quickly can scheduling experiments be repeated with modified inputs or configuration settings?
Kind of topic: overview
Supervisor: Christian Wassermann
Enhancing the Observability of HPC Applications with eBPF
The complexity and scale of today’s HPC systems challenges application developers and performance engineers alike. To assess the utilization achieved by a given application, runtime measurements form the basis of the typical HPC performance analysis workflow. For maximum utility, the data collection should be as accurate as possible while not distorting the execution of the analyzed application. A recent addition to the Linux toolbelt is eBPF, allowing a sandboxed execution of user-defined programs within kernel space. Through the interception of kernel events, eBPF enables previously unfeasible forms of observability.
In this thesis, the student should investigate the potential for eBPF in HPC environments specifically considering use cases related to performance analysis and system monitoring. By diving into a few selected papers, benefits and drawbacks of eBPF compared to traditional tools should be highlighted and critically assessed. Throughout the evaluation, technical details should be included to illustrate important low-level eBPF-specific concepts.
Kind of topic: dive-in
Supervisor: Christian Wassermann
Supervisors & Organization
Semih Burak
Jannis Klinkenberg
Philipp Martin
Ben Thärigen
Dominik Viehhauser
Christian Wassermann