Topics
Exploring the Impact of HBM-Enabled CPUs
High-performance computing (HPC) continues to demand advanced memory solutions to handle increasingly complex and data-intensive workloads. While high-bandwidth memory (HBM) has already been employed extensively in GPU accelerators, its performance impact on CPU-based HPC systems has only recently been explored.
This seminar thesis investigates how HBM architectures influence the performance of demanding HPC workloads and compares the trade-offs between HBM and conventional DDR memory, including various memory modes. In addition, the evaluation will be partially conducted on the CLAIX supercomputer, offering practical insights into both the advantages and limitations of next-generation memory technologies in real-world HPC environments.
Kind of topic: overview
Supervisor: Tim Cramer
Governing OpenMP Task Scheduling Policies
OpenMP runtime implementations don't always choose the optimal strategy for queueing and scheduling tasks. Also, different implementations behave differently. A proposal suggests to add new hint for improved runtime behavior guided by the application developer.
This seminar thesis will provide an overview about different documented (or empirically found) queueing and scheduling strategies of OpenMP runtime implementations. It will survey OpenMP tasking-related publications and provide an overview about strategies to improve task scheduling behavior with existing techniques as well as proposals for extension of the OpenMP API.
Kind of topic: overview
Supervisor: Joachim Jenke
Mitigating Load Imbalances in Hybrid MPI+OpenMP codes
Load imbalances are one limiting factor for parallel scalability. In cases where avoiding the load imbalance is difficult, thread malleability can help to temporarily increase the available resources for a process with higher workload. Different frame works were presented that provide such malleability to application.
The seminar thesis will provide an overview about available works and compare the approaches. It will further discuss the performance gains presented in the related works.
Kind of topic: overview
Supervisor: Joachim Jenke
Evaluating Task-Based Distributed Computing for Heterogeneous Architectures
Maximizing performance in HPC applications targeting heterogeneous architectures across multiple nodes has traditionally relied on manual workload distribution and resource tuning. These systems, composed of diverse compute units such as CPUs, GPUs, and specialized accelerators, present significant challenges in achieving balanced execution. Manual approaches often struggle to adapt to dynamic workloads or evolving resource availability, especially at scale.
To address these challenges, task-based programming models have emerged as a flexible alternative. By abstracting computation into fine-grained tasks with explicit dependencies, task-based runtimes can dynamically schedule and balance workloads across heterogeneous resources and distributed memory nodes. This seminar thesis should investigate the capabilities and conceptual foundations of task-based runtimes for heterogeneous, distributed systems. It should further compare selected frameworks in terms of scheduling, load balancing, and dependency management, and present performance measurements to assess their effectiveness in real-world scenarios.
Kind of topic: overview
Supervisor: Jan Kraus
Investigating the Performance of Stdpar Offloading Implementations
With the introduction of C++17, the C++ standard library gained support for parallel algorithms through the std::execution framework, enabling more expressive and efficient parallelism on CPUs. While originally limited to CPU execution, recent compiler developments have extended this capability to support GPU offloading, making it possible to run standard parallel algorithms on heterogeneous systems without abandoning familiar C++ abstractions.
This seminar thesis should examine the current state of GPU-enabled stdpar implementations, focusing in particular on the support provided by NVIDIA's nvc++ compiler and AdaptiveCpp. The student should analyze how each implementation maps standard parallel algorithms to the GPU, and evaluate their strengths and limitations in terms of programmability and performance. Comparative benchmarks on representative workloads can be conducted optionally to assess the efficiency and practicality of using stdpar as a unified parallel programming model for both CPU and GPU execution.
Kind of topic: dive-in
Supervisor: Jan Kraus
HPC Correctness Checking with AI: Challenges and Opportunities
Recently, AI methods have been used to tackle different research problems, mostly in the form of (fine-tuned) Large Language Models (LLMs) or other neural network approaches. Correctness checking for parallel programs (using MPI or OpenMP) typically relies on static analyses (data flow, control low) or dynamic analyses (state tracking at runtime, post-mortem analysis of logs). The quality of a correctness checking algorithm in particular depends on its accuracy, e.g., whether an error in an incorrect program is detected (true positive) or whether the tool does not report an error on a correct program (true negative). Further, the tool report should be reproducible, i.e., when given the same code, it should always report the same result. At first glance, AI methods do not seem to be well-suited in an area where any inaccuracy, especially falsely reported errors (false positives), or flaky results may significantly reduce the user acceptance of such a tool. However, some researchers have still achieved acceptable results for certain verification problems using fine-tuned LLMs and Graph Neural Networks (GNNs).
The seminar thesis should discuss the question if AI methods are suited as a replacement or addition to classical correctness checking methods. This includes a systematic literature review of past research efforts on HPC correctness checking with AI methods. In that context, the thesis should discuss the challenges, such as limited training data, reproducibility of results, or sources of inaccuracies, and compare it with opportunities such as generalizability, accessibility, and scalability. Optionally, the student may perform their own classification quality studies, e.g., evaluating the detection quality of a given LLM on a set of benchmarks.
Kind of topic: overview
Supervisor: Simon Schwitanski
Data Race Detection in GPU Programs
GPU programming has become a standard approach for highly data-parallel computations such as matrix-matrix multiplications. A GPU comprises millions of cores that can access different levels of shared memory concurrently, i.e., the hardware is inherently designed for concurrent memory accesses. However, programmers have to coordinate accesses to the memory appropriately. Two or more concurrent accesses to the same memory location by different threads with at least one being a write and no proper synchronization, lead to a so-called "data race". A data race leads to undefined behavior of the program, i.e., anything can happen during the execution. This nondeterministic nature makes it difficult to detect data races since they might be hidden at development time and may only become visible in production. Since data races are a typical problem in computer science, much effort has been spent on mature data
race detection algorithms and tools such as ThreadSanitizer. However, most of these tools have been designed to detect data races on CPUs. Due to the different architecture of GPUs (significantly more cores, shared control flow via warps, less available main memory), existing data race detection tools for CPUs cannot be used for GPUs. For GPUs, only a few approaches have been proposed in the past: Some rely on detecting the GPU data races by source code inspection (statically), others run the program in a simulator, while more recent (dynamic) approaches run natively on the GPU together with the program to analyze.
This seminar thesis should present the difficulties of data race detection on GPUs and give an overview of the different classes of data race detectors for GPU programs. The student should in particular focus on the recently proposed approach "HiRace" which performs a source code instrumentation of the GPU code and uses a state machine with constant memory overhead to perform the race detection at runtime. The seminar thesis should explain the race detection algorithm of HiRace, how it differs from previous work, and outline limitations of the approach. Optionally, some own experiments might be performed to reproduce the results of the HiRace authors.
Kind of topic: dive-in
Supervisor: Simon Schwitanski
GPU Stream Semantics for MPI
Modern HPC systems make extensive use of compute accelerators. Recent communication libraries, including the collective communication libraries NCCL and RCCL, have been developed to define stream-based semantics to enhance support for GPU-accelerated applications. Although MPI is the de facto standard for distributed-memory communication in HPC, as of MPI 5.0, the MPI standard still does not define GPU support, e.g., in the form of stream semantics.
In this thesis, the student should conduct a literature review on approaches to address GPU support in MPI programs. Optionally, the student may evaluate the status of proposed prototypes on CLAIX-2023.
Kind of topic: overview
Supervisor: Felix Tomski
Collective Contracts for Message-Passing Parallel Programs
Extensive research exists on correctness checking MPI programs, mainly focusing on dynamic approaches. One static approach to program verification is procedure contracts, which are widely used outside the HPC world, e.g., for serial C or Java programs.
In this thesis, the student should present the proposed contract theory for collective message-passing procedures and explain how it can be employed to verify the correctness of MPI programs. Optionally, the student may conduct own experiments by evaluating the proposed approach on test cases from commonly used benchmark suites for correctness checking.
Kind of topic: dive-in
Supervisor: Felix Tomski
Evaluation Methodologies of HPC Application Benchmarks for HPC
Procurements
Benchmarking is an essential part of the procurement of HPC systems to ensure that the requested features of the new cluster will and was delivered by the vendor as promised. Especially, before the request of proposal (RFP) of an HPC procurement is published, rigorous benchmarking on current HPC clusters and new testbed hardware must be executed to be able to "predict" performance and energy consumption of the benchmarks running on potential hardware that will be delivered by the vendors in the future. For that, the measured results must be evaluated and conclusions drawn for the HPC tender document. One additional consideration is also which benchmark to include in the HPC tender: HPC application benchmarks shall represent the workload on the cluster and shall be focus of this work.
This seminar thesis shall investigate and compare different Evaluation methodologies of HPC application benchmarks with respect to their usage in HPC procurements and acceptance tests. "Methodolgies" mean various
approaches to compare and predict benchmark evaluation results. Here, the figure of merit of evaluating benchmarks shall be in terms of performance (runtime, bandwidth, efficiency,...), energy consumption and categories that
represent the cluster workload, e.g., by so-called motifs or simply by hardware-dominant behavior (compute, memory, IO,...). HPC benchmark suites used in procurements and shall be (at least) investigated in this seminar thesis are, e.g., the JUPITER Benchmark Suite, the PRACE UEABS, the NERSC-10 Benchmark Suite or the CORAL-2 benchmarks.
Kind of topic: overview
Supervisor: Sandra Wienke
More topics following soon...
Supervisors & Organization
Tim Cramer
Joachim Jenke
Jan Kraus
Simon Schwitanski
Ben Thärigen
Felix Tomski
Sandra Wienke