Seminar Current Topics in High-Performance Computing

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.
Futhermore, we will introduce the students to "Scientific Writing" and "Scientific Presenting" in computer science. These two events are also compulsory in attendance.

The compulsory introductory event (kickoff) is scheduled for Wednesday, October 15, 9 a.m. - 11.a.m. The next compulsory meetings are planned for TBA.

Furthermore, the seminar is an in-person event. That means that you need to be personally present for all compulsory parts of the seminar.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Types of Topics

We provide two flavors of seminar topics depending on the particular topic: (a) overview topics, and (b) dive-in topics. It works as the names suggest. Nevertheless, this categorization does not necessarily imply a strict "either-or" but rather provides a guideline for addressing the topic. In general, both types of topics are equally difficult to work on. However, they have different challenges. In the topic list below, you can also find the corresponding categorizations for the seminar topic types.

Topics

Exploring the Impact of HBM-Enabled CPUs

High-performance computing (HPC) continues to demand advanced memory solutions to handle increasingly complex and data-intensive workloads. While high-bandwidth memory (HBM) has already been employed extensively in GPU accelerators, its performance impact on CPU-based HPC systems has only recently been explored.

This seminar thesis investigates how HBM architectures influence the performance of demanding HPC workloads and compares the trade-offs between HBM and conventional DDR memory, including various memory modes. In addition, the evaluation will be partially conducted on the CLAIX supercomputer, offering practical insights into both the advantages and limitations of next-generation memory technologies in real-world HPC environments.

Kind of topic: overview
Supervisor: Tim Cramer

Governing OpenMP Task Scheduling Policies

OpenMP runtime implementations don't always choose the optimal strategy for queueing and scheduling tasks. Also, different implementations behave differently. A proposal suggests to add new hint for improved runtime behavior guided by the application developer.

This seminar thesis will provide an overview about different documented (or empirically found) queueing and scheduling strategies of OpenMP runtime implementations. It will survey OpenMP tasking-related publications and provide an overview about strategies to improve task scheduling behavior with existing techniques as well as proposals for extension of the OpenMP API.

Kind of topic: overview
Supervisor: Joachim Jenke

Mitigating Load Imbalances in Hybrid MPI+OpenMP codes

Load imbalances are one limiting factor for parallel scalability. In cases where avoiding the load imbalance is difficult, thread malleability can help to temporarily increase the available resources for a process with higher workload. Different frame works were presented that provide such malleability to application.

The seminar thesis will provide an overview about available works and compare the approaches. It will further discuss the performance gains presented in the related works.

Kind of topic: overview
Supervisor: Joachim Jenke

Evaluating Task-Based Distributed Computing for Heterogeneous Architectures

Maximizing performance in HPC applications targeting heterogeneous architectures across multiple nodes has traditionally relied on manual workload distribution and resource tuning. These systems, composed of diverse compute units such as CPUs, GPUs, and specialized accelerators, present significant challenges in achieving balanced execution. Manual approaches often struggle to adapt to dynamic workloads or evolving resource availability, especially at scale.

To address these challenges, task-based programming models have emerged as a flexible alternative. By abstracting computation into fine-grained tasks with explicit dependencies, task-based runtimes can dynamically schedule and balance workloads across heterogeneous resources and distributed memory nodes. This seminar thesis should investigate the capabilities and conceptual foundations of task-based runtimes for heterogeneous, distributed systems. It should further compare selected frameworks in terms of scheduling, load balancing, and dependency management, and present performance measurements to assess their effectiveness in real-world scenarios.

Kind of topic: overview
Supervisor: Jan Kraus

Investigating the Performance of Stdpar Offloading Implementations

With the introduction of C++17, the C++ standard library gained support for parallel algorithms through the std::execution framework, enabling more expressive and efficient parallelism on CPUs. While originally limited to CPU execution, recent compiler developments have extended this capability to support GPU offloading, making it possible to run standard parallel algorithms on heterogeneous systems without abandoning familiar C++ abstractions.

This seminar thesis should examine the current state of GPU-enabled stdpar implementations, focusing in particular on the support provided by NVIDIA's nvc++ compiler and AdaptiveCpp. The student should analyze how each implementation maps standard parallel algorithms to the GPU, and evaluate their strengths and limitations in terms of programmability and performance. Comparative benchmarks on representative workloads can be conducted optionally to assess the efficiency and practicality of using stdpar as a unified parallel programming model for both CPU and GPU execution.

Kind of topic: dive-in
Supervisor: Jan Kraus

HPC Correctness Checking with AI: Challenges and Opportunities

Recently, AI methods have been used to tackle different research problems, mostly in the form of (fine-tuned) Large Language Models (LLMs) or other neural network approaches. Correctness checking for parallel programs (using MPI or OpenMP) typically relies on static analyses (data flow, control low) or dynamic analyses (state tracking at runtime, post-mortem analysis of logs). The quality of a correctness checking algorithm in particular depends on its accuracy, e.g., whether an error in an incorrect program is detected (true positive) or whether the tool does not report an error on a correct program (true negative). Further, the tool report should be reproducible, i.e., when given the same code, it should always report the same result. At first glance, AI methods do not seem to be well-suited in an area where any inaccuracy, especially falsely reported errors (false positives), or flaky results may significantly reduce the user acceptance of such a tool. However, some researchers have still achieved acceptable results for certain verification problems using fine-tuned LLMs and Graph Neural Networks (GNNs).

The seminar thesis should discuss the question if AI methods are suited as a replacement or addition to classical correctness checking methods. This includes a systematic literature review of past research efforts on HPC correctness checking with AI methods. In that context, the thesis should discuss the challenges, such as limited training data, reproducibility of results, or sources of inaccuracies, and compare it with opportunities such as generalizability, accessibility, and scalability. Optionally, the student may perform their own classification quality studies, e.g., evaluating the detection quality of a given LLM on a set of benchmarks.

Kind of topic: overview
Supervisor: Simon Schwitanski

Data Race Detection in GPU Programs

GPU programming has become a standard approach for highly data-parallel computations such as matrix-matrix multiplications. A GPU comprises millions of cores that can access different levels of shared memory concurrently, i.e., the hardware is inherently designed for concurrent memory accesses. However, programmers have to coordinate accesses to the memory appropriately. Two or more concurrent accesses to the same memory location by different threads with at least one being a write and no proper synchronization, lead to a so-called "data race". A data race leads to undefined behavior of the program, i.e., anything can happen during the execution. This nondeterministic nature makes it difficult to detect data races since they might be hidden at development time and may only become visible in production. Since data races are a typical problem in computer science, much effort has been spent on mature data
race detection algorithms and tools such as ThreadSanitizer. However, most of these tools have been designed to detect data races on CPUs. Due to the different architecture of GPUs (significantly more cores, shared control flow via warps, less available main memory), existing data race detection tools for CPUs cannot be used for GPUs. For GPUs, only a few approaches have been proposed in the past: Some rely on detecting the GPU data races by source code inspection (statically), others run the program in a simulator, while more recent (dynamic) approaches run natively on the GPU together with the program to analyze.

This seminar thesis should present the difficulties of data race detection on GPUs and give an overview of the different classes of data race detectors for GPU programs. The student should in particular focus on the recently proposed approach "HiRace" which performs a source code instrumentation of the GPU code and uses a state machine with constant memory overhead to perform the race detection at runtime. The seminar thesis should explain the race detection algorithm of HiRace, how it differs from previous work, and outline limitations of the approach. Optionally, some own experiments might be performed to reproduce the results of the HiRace authors.

Kind of topic: dive-in
Supervisor: Simon Schwitanski

GPU Stream Semantics for MPI

Modern HPC systems make extensive use of compute accelerators. Recent communication libraries, including the collective communication libraries NCCL and RCCL, have been developed to define stream-based semantics to enhance support for GPU-accelerated applications. Although MPI is the de facto standard for distributed-memory communication in HPC, as of MPI 5.0, the MPI standard still does not define GPU support, e.g., in the form of stream semantics.

In this thesis, the student should conduct a literature review on approaches to address GPU support in MPI programs. Optionally, the student may evaluate the status of proposed prototypes on CLAIX-2023.

Kind of topic: overview
Supervisor: Felix Tomski

Collective Contracts for Message-Passing Parallel Programs

Extensive research exists on correctness checking MPI programs, mainly focusing on dynamic approaches. One static approach to program verification is procedure contracts, which are widely used outside the HPC world, e.g., for serial C or Java programs.

In this thesis, the student should present the proposed contract theory for collective message-passing procedures and explain how it can be employed to verify the correctness of MPI programs. Optionally, the student may conduct own experiments by evaluating the proposed approach on test cases from commonly used benchmark suites for correctness checking.

Kind of topic: dive-in
Supervisor: Felix Tomski

Evaluation Methodologies of HPC Application Benchmarks for HPC
Procurements

Benchmarking is an essential part of the procurement of HPC systems to ensure that the requested features of the new cluster will and was delivered by the vendor as promised. Especially, before the request of proposal (RFP) of an HPC procurement is published, rigorous benchmarking on current HPC clusters and new testbed hardware must be executed to be able to "predict" performance and energy consumption of the benchmarks running on potential hardware that will be delivered by the vendors in the future. For that, the measured results must be evaluated and conclusions drawn for the HPC tender document. One additional consideration is also which benchmark to include in the HPC tender: HPC application benchmarks shall represent the workload on the cluster and shall be focus of this work.

This seminar thesis shall investigate and compare different Evaluation methodologies of HPC application benchmarks with respect to their usage in HPC procurements and acceptance tests. "Methodolgies" mean various
approaches to compare and predict benchmark evaluation results. Here, the figure of merit of evaluating benchmarks shall be in terms of performance (runtime, bandwidth, efficiency,...), energy consumption and categories that
represent the cluster workload, e.g., by so-called motifs or simply by hardware-dominant behavior (compute, memory, IO,...). HPC benchmark suites used in procurements and shall be (at least) investigated in this seminar thesis are, e.g., the JUPITER Benchmark Suite, the PRACE UEABS, the NERSC-10 Benchmark Suite or the CORAL-2 benchmarks.

Kind of topic: overview
Supervisor: Sandra Wienke

More topics following soon...

Supervisors & Organization

Tim Cramer
Joachim Jenke
Jan Kraus
Simon Schwitanski
Ben Thärigen
Felix Tomski
Sandra Wienke

Kontakt

Ben Thärigen