Seminar Current Topics in High-Performance Computing

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.
Futhermore, we will introduce the students to "Scientific Writing" and "Scientific Presenting" in computer science. These two events are also compulsory in attendance.

The compulsory introductory event (kickoff) is scheduled for Wednesday, October 14, 3 p.m. - 5 p.m. The next compulsory meetings are planned for TBA.

Furthermore, the seminar is an in-person event. That means that you need to be personally present for all compulsory parts of the seminar.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

Usually, the mandatory meetings will be held in English. We prefer and encourage students to also do the report and presentation in English. But, German is also possible.

Types of Topics

We provide two flavors of seminar topics depending on the particular topic: (a) overview topics, and (b) dive-in topics. It works as the names suggest. Nevertheless, this categorization does not necessarily imply a strict "either-or" but rather provides a guideline for addressing the topic. In general, both types of topics are equally difficult to work on. However, they have different challenges. In the topic list below, you can also find the corresponding categorizations for the seminar topic types.

Topics

Comparing CPU Offloading Methods for LLM Inference

Current Large Language Model (LLM) serving systems typically employ CPUs for request orchestration, scheduling, and data pre- and postprocessing, while the computationally intensive inference workload is executed on GPUs. However, the continued growth of models and increasingly diverse requests have substantially increased memory requirements. As a result, model weights and Key-Value (KV) caches often no longer fit entirely within GPU memory and are increasingly placed in the comparatively larger memory available on the CPU side. While this increases the effective memory capacity, it limits the performance due to the limited bandwith and increased latency of PCIe. At the same time, the most recent CPU architectures feature matrix acceleration hardware, similar to NVIDIAs Tensor cores, which can deliver substantially improved LLM inference performance. Together, these developments suggest that CPUs may become a significant part of LLM inference, by offloading parts of the computation to them and thus optimally utilizing their compute and avoid the PCIe bottleneck. Emerging closely- and tightly coupled hardware, such as the NVIDIA GH200, provides new opportunities for jointly optimizing memory placement and computation across CPUs and GPUs.

In this seminar thesis, the student shall give a structured overview, categorization, and critical comparison/evaluation of the current state of offloading LLM inference to CPUs. They should highlight the differences and trade-offs between approaches, as well as identify open questions and research gaps. Ideally, a comparison to offloading methods and potentials in other deep learning architectures, such as diffusion models or mixture of experts (MoE), is made. In addition, a perspective on the opportunities and limitations of offloading on novel closely coupled hardware (Grace Hopper) should also be given.
Optionally, the student may also model the potential performance on the Grace Hopper and measure it with benchmarks (FLOPs, Memory BW, ...). If desired this can be extended to implement and evaluate a small prototype offloading solution.

Kind of topic: overview
Supervisor: Tom Hilgers

Emerging Lossless Compression Methods for LLM Inference

The rapid scaling of Large Language Models (LLMs) has caused many models to exceed the memory capacity of a single GPU, outpacing hardware improvements. In practical deployment settings, this issue is further amplified by the use of Key-Value (KV) caching, which are essential for efficient autoregressive inference by avoiding redudant recomputation of attention states. While KV caching significantly reduces computational overhead, it introduces substantial additional memory capacity and bandwith pressure during inference, making the token generation (decode phase) memory-bound and thereby limiting throughput and scalability. Existing approaches to mitigate these constraints predominantly rely on cache eviction strategies or lossy compression techniques such as quantization and sparsification, which inevitably reduce the output quality. In contrast, lossless compression methods for model weights and KV caching are comparatively underexplored, with many open questions, challenges, and opportunities remaining.

In this seminar thesis, the student shall give a structured overview, categorization, and critical comparison/evaluation of lossless compression methods for LLM inference, with a particular emphasis on KV cache compression. These methods should be placed within the broader context of memory management for LLM inference (cache management, eviction, and reuse; lossy compression, weight compression). Ideally, a comparison to compression methods and potentials in other deep learning architectures, such as diffusion models, is made. A particular focus of this thesis should be the highlighting of the limits of existing approaches as well as identifying underexplored research directions in the literature, such as the utilization of unused hardware components (e.g. video compression engines), other numerical formats (FP4, FP8, INT, ...), compression algorithms and more.
A number of potential practical experiments are also possible in this thesis. These include but are not limited to setting up a testing pipeline, reimplementing methods (such as where code is not publicly available), benchmarking hardware units, and investigating own novel ideas.

Kind of topic: dive-in
Supervisor: Tom Hilgers

Detecting Bugs in High‑Performance Computing (HPC) Applications with Deep Learning and Large Language Models

Software‑engineering research shows that transformer‑based neural networks can support tasks such as bug localisation, program repair, and test‑case generation. Large language models (LLMs) like CodeBERT, Codex, GPT‑4, or Claude have demonstrated strong abilities to understand and generate source code, opening the possibility of automated bug
detection. In the HPC domain, bugs (data races, deadlocks, incorrect message‑passing, off‑by‑one errors, etc.) are especially costly because they may remain hidden until a program runs at scale. Existing work tackles specific sub‑problems (e.g. data‑race detection in OpenMP, MPI misuse detection).

This seminar thesis will provide a critical, comparative survey of machine‑learning (ML) approaches that have been applied to detect programming errors in parallel/HPC applications, and evaluate how their reported performance stacks up against traditional static‑ and dynamic‑analysis tools.

Kind of topic: overview
Supervisor: Joachim Jenke

Evaluation of Modern C++ Interfaces for MPI - A Comparative Study

The original C++ bindings of the MPI (Message‑Passing Interface) standard were removed in the MPI‑2.2 specification (2008). Since then a variety of third‑party projects have introduced their own C++ interfaces to enable more idiomatic, type‑safe and expressive usage of MPI from C++. These proposals differ widely in scope, design philosophy, and the
extent to which they exploit recent C++ language features (e.g., templates, concepts, constexpr, ranges).

The thesis will survey, classify and evaluate the most relevant C++ APIs for MPI that have been published or released after the deprecation of the official bindings. The thesis will discuss advantages and disadvantages of the different proposals.
Possible questions to address in the thesis are: Which design goals do the individual APIs pursue (e.g., type safety, automatic deduction, zero‑overhead abstraction, integration of C++20 concepts, support for heterogeneous hardware)? How do the APIs differ in terms of usability (API surface, learning curve, documentation) and correctness guarantees (compile‑time checks, exception safety, RAII)? Are there measurable performance differences (latency, bandwidth, overhead of abstraction) when the same communication pattern is expressed with the different APIs? What recommendations can be derived for developers who need a C++ interface to MPI in various scenarios (high‑performance scientific codes, pedagogical examples, rapid prototyping)?

Kind of topic: overview
Supervisor: Joachim Jenke

Exploiting Task Graphs for Efficient GPU Offloading in Heterogeneous HPC Systems

The increasing adoption of heterogeneous computing architectures poses significant challenges for modern high-performance computing (HPC). While directive-based programming models like OpenMP simplify parallelization, their ability to fully exploit the performance potential of GPUs remains limited due to architectural mismatches and runtime overheads. Emerging execution models, such as task graphs, offer a promising solution by enabling fine-grained, dependency-aware scheduling of GPU workloads, thereby reducing synchronization bottlenecks and improving resource utilization. However, integrating such models into existing programming frameworks while maintaining portability and performance remains an open research challenge.

This seminar thesis aims to investigate the role of task graphs in enhancing the efficiency of GPU-accelerated applications. It will explore how task graphs can be leveraged to optimize unstructured parallel workloads, where traditional execution models struggle with load imbalance and synchronization overhead. The work will analyze core concepts such as graph construction, dependency management, and runtime scheduling, with a focus on OpenMP-based implementations. Furthermore, it will critically assess the trade-offs between structured and unstructured parallelism, evaluating performance gains in both scenarios. A comparative evaluation of scheduling strategies and their impact on GPU utilization will be conducted, supported by empirical performance data where available.

Kind of topic: overview
Supervisor: Jan Kraus

GPU-Initiated Communication for Scalable Distributed-Memory Applications on Heterogeneous Systems

The continued growth of GPU performance in modern high-performance computing (HPC) systems has shifted many applications from being computation-bound to communication-bound. While traditional communication frameworks such as MPI have enabled scalable distributed computing for decades, their CPU-centric design introduces synchronization overheads and latency that increasingly limit the performance of GPU-resident applications. In particular, applications with fine-grained communication patterns and strong-scaling requirements often suffer from reduced overlap between computation and communication, leading to underutilized accelerator resources and diminishing scalability.

This seminar thesis aims to investigate emerging GPU-initiated communication paradigms and their potential to address the limitations of conventional CPU-driven communication models. The work will explore the principles of GPU-resident communication, including one-sided communication mechanisms, remote memory access, and fine-grained synchronization techniques. Particular attention will be given to communication-computation overlap, dependency management, and the role of modern interconnect technologies in enabling low-latency data exchange between accelerators. Furthermore, the seminar will analyze how GPU-initiated communication can reduce control-path overheads and improve strong-scaling efficiency in distributed applications. A comparative discussion of traditional and GPU-driven communication approaches will be provided, highlighting their advantages, challenges, and applicability to future exascale and post-exascale HPC systems.

Kind of topic: overview
Supervisor: Jan Kraus

Challenges in Dynamic Load Balancing

Modern HPC systems have become increasingly heterogeneous systems, that consists of CPUs and GPUs or other specialized hardware accelerators.
At the same time, many real-world applications exhibit highly dynamic behavior, where computational demands shift significantly over time and are difficult to predict in advance.
Static load balancing strategies are therefore insufficient to efficiently utilize the available heterogeneous hardware resources, as they cannot adapt to runtime variations in task granularity or hardware availability. Dynamic load balancing addresses this shortcoming by continuously monitoring and redistributing workload at runtime, but introduces its own set of challenges, including overhead from task migration, synchronization costs, and the difficulty of making timely balancing decisions with incomplete global knowledge. Solving dynamic load balance issues at the application level is time-consuming and requires domain-specific knowledge. Thus, existing approaches, such as DLB, SABO, or DCB aim to solve the problem in general by dynamically redistributing available hardware resources between processes and threads.

In this seminar thesis, the student is expected to survey existing dynamic load balancing approaches and to critically discuss their respective limitations and open challenges.

Kind of topic: overview
Supervisor: Fabian Orland

Analysis of Noise on HPC Systems

"Noise" refers to activities on a system that do not orignate in the the actual (scientific) application running on the system. It can be caused by, e.g., background processes, events or interrupts. The most prominent noise is system noise (aka interference or jitter) usually due to activities of the operating system (OS) itself. But also network-induced interrupts can cause noise. Noise on an HPC system may result in high run-to-run variation of the actual scientific application (depending on the amount of interrupts during the run), or in generally poor performance of an application. The latter may be caused, e.g., by noise-induced stalls in collective MPI operations and gets usually worse when scaling up the number of compute nodes used.

The seminar thesis shall give an overview of different sources of noise occuring in a (Linux-based) HPC system (single node, multiple nodes, network etc.). Furthermore, it shall summarize how noise (and their sources) can be detected, as well as the (quantitative) impact of noise (as found in the literature) on performance or other metrics (e.g., certain hardware counters). Moreover, it is desirable to work with measurements from CLAIX-2023 and/or CLAIX-2025 from some noise-related synthetic benchmarks (e.g., Netgauge, osnoise, OSU micro benchmarks - as explained in the start paper), analyze the results and compare them to the results of the HPC system "El Capitan" (no. 1 system in Top500 11/2025) (see start paper) and possibly to other systems found in related work. Measurements from CLAIX-2023/2025 can be done by the student or provided by the supervisor. If (in any case) CLAIX measurements should not be available for a comparatative analysis, the seminar thesis shall additionally cover possible solutions to mitgate system noise on HPC systems, and look at the noise-sensitivity of applications in more detail.

Kind of topic: dive-in
Supervisor: Sandra Wienke

More topics following soon...

Supervisors & Organization

Tom Hilgers
Joachim Jenke
Jan Kraus
Fabian Orland
Ben Thärigen
Sandra Wienke

Kontakt

er/ihm

Ben Thärigen