Topics
Exploring the Impact of HBM-Enabled CPUs
High-performance computing (HPC) continues to demand advanced memory solutions to handle increasingly complex and data-intensive workloads. While high-bandwidth memory (HBM) has already been employed extensively in GPU accelerators, its performance impact on CPU-based HPC systems has only recently been explored.
This seminar thesis investigates how HBM architectures influence the performance of demanding HPC workloads and compares the trade-offs between HBM and conventional DDR memory, including various memory modes. In addition, the evaluation will be partially conducted on the CLAIX supercomputer, offering practical insights into both the advantages and limitations of next-generation memory technologies in real-world HPC environments.
Kind of topic: overview
Supervisor: Tim Cramer
Investigating the Influence of Write-Allocate Evasion on HPC Workloads
Modern processors rely heavily on caches to enhance performance by keeping recently accessed data close to the execution units. In conventional write-allocate cache architectures, however, the required read-before-write can introduce unnecessary overhead in certain scenarios. To mitigate this, many systems offer non-temporal (or “streaming”) stores that bypass the cache. Starting with the Ice Lake microarchitecture, Intel introduced a hardware optimization called Write-Allocate Evasion to reduce redundant memory traffic.
This seminar thesis requires the student to explain the Write-Allocate Evasion mechanism and analyze its impact on typical HPC workloads. An in-depth study of a selected benchmark (e.g., CloverLeaf) is expected. The student will have the opportunity to evaluate the effects on the CLAIX supercomputer.
Kind of topic: in-depth
Supervisor: Tim Cramer
Governing OpenMP Task Scheduling Policies
OpenMP runtime implementations don't always choose the optimal strategy for queueing and scheduling tasks. Also, different implementations behave differently. A proposal suggests to add new hint for improved runtime behavior guided by the application developer.
This seminar thesis will provide an overview about different documented (or empirically found) queueing and scheduling strategies of OpenMP runtime implementations. It will survey OpenMP tasking-related publications and provide an overview about strategies to improve task scheduling behavior with existing techniques as well as proposals for extension of the OpenMP API.
Kind of topic: overview
Supervisor: Joachim Jenke
Mitigating Load Imbalances in Hybrid MPI+OpenMP codes
Load imbalances are one limiting factor for parallel scalability. In cases where avoiding the load imbalance is difficult, thread malleability can help to temporarily increase the available resources for a process with higher workload. Different frame works were presented that provide such malleability to application.
The seminar thesis will provide an overview about available works and compare the approaches. It will further discuss the performance gains presented in the related works.
Kind of topic: overview
Supervisor: Joachim Jenke
Evaluating Task-Based Distributed Computing for Heterogeneous Architectures
Maximizing performance in HPC applications targeting heterogeneous architectures across multiple nodes has traditionally relied on manual workload distribution and resource tuning. These systems, composed of diverse compute units such as CPUs, GPUs, and specialized accelerators, present significant challenges in achieving balanced execution. Manual approaches often struggle to adapt to dynamic workloads or evolving resource availability, especially at scale.
To address these challenges, task-based programming models have emerged as a flexible alternative. By abstracting computation into fine-grained tasks with explicit dependencies, task-based runtimes can dynamically schedule and balance workloads across heterogeneous resources and distributed memory nodes. This seminar thesis should investigate the capabilities and conceptual foundations of task-based runtimes for heterogeneous, distributed systems. It should further compare selected frameworks in terms of scheduling, load balancing, and dependency management, and present performance measurements to assess their effectiveness in real-world scenarios.
Kind of topic: overview
Supervisor: Jan Kraus
Investigating the Performance of Stdpar Offloading Implementations
With the introduction of C++17, the C++ standard library gained support for parallel algorithms through the std::execution framework, enabling more expressive and efficient parallelism on CPUs. While originally limited to CPU execution, recent compiler developments have extended this capability to support GPU offloading, making it possible to run standard parallel algorithms on heterogeneous systems without abandoning familiar C++ abstractions.
This seminar thesis should examine the current state of GPU-enabled stdpar implementations, focusing in particular on the support provided by NVIDIA's nvc++ compiler and AdaptiveCpp. The student should analyze how each implementation maps standard parallel algorithms to the GPU, and evaluate their strengths and limitations in terms of programmability and performance. Comparative benchmarks on representative workloads can be conducted optionally to assess the efficiency and practicality of using stdpar as a unified parallel programming model for both CPU and GPU execution.
Kind of topic: dive-in
Supervisor: Jan Kraus
Comparison of Performance Analysis Tools for (OpenMP) Task-Based Applications
Modern applications often have complex dependencies between workpackages and standard blocksynchronous parallelization approaches introduced unnecessary waiting times. Tasking allows to define workpackages and their dependencies directly, which then get scheduled to threads. These fine-grained synchronization can bring performance benefits. While the concept of tasking exists in OpenMP already since more than 15 years, common performance analysis tools still provide very few task-specific analysis options. Task-specific tools have been proposed over 10 years ago, but most of them are no longer developed or existing at all. However, there have been a few publications in recent years that propose new tools for task-based analysis.
This thesis should provide an overview of old and new performance analysis tools that support a (OpenMP) task-specific analysis, based on an extensive literature review. The tools should be compared in regards to their functionalities, usability and overhead. If applicable, a categorization of the tools can also be integrated.
Kind of topic: overview
Supervisor: Ben Thärigen
Determining Parallelization Potential in Parallel Programs
In a world where applications grow larger and more complex by the day and computer systems are dominated by multiple-core architectures, it becomes crucial to effectively parallelize your programs to obtain performant executions. However, doing so can be quite a challenge when having to consider millions of code lines and hundreds of code regions. A lot of work has been done to (partially) automate the workflow of determining code regions which are not only parallelizable, but also promise adequate performance improvements.
For this paper, the student should perform an extensive literature review to discover different strategies for determining parallelization potential of code regions in modern programs. In the paper, the different approaches should be presented and compared to each other by evaluating their strengths and weaknesses.
Kind of topic: overview
Supervisor: Ben Thärigen
GPU Stream Semantics for MPI
Modern HPC systems make extensive use of compute accelerators. Recent communication libraries, including the collective communication libraries NCCL and RCCL, have been developed to define stream-based semantics to enhance support for GPU-accelerated applications. Although MPI is the de facto standard for distributed-memory communication in HPC, as of MPI 5.0, the MPI standard still does not define GPU support, e.g., in the form of stream semantics.
In this thesis, the student should conduct a literature review on approaches to address GPU support in MPI programs. Optionally, the student may evaluate the status of proposed prototypes on CLAIX-2023.
Kind of topic: overview
Supervisor: Felix Tomski
Collective Contracts for Message-Passing Parallel Programs
Extensive research exists on correctness checking MPI programs, mainly focusing on dynamic approaches. One static approach to program verification is procedure contracts, which are widely used outside the HPC world, e.g., for serial C or Java programs.
In this thesis, the student should present the proposed contract theory for collective message-passing procedures and explain how it can be employed to verify the correctness of MPI programs. Optionally, the student may conduct own experiments by evaluating the proposed approach on test cases from commonly used benchmark suites for correctness checking.
Kind of topic: dive-in
Supervisor: Felix Tomski
Evaluation Methodologies of HPC Application Benchmarks for HPC
Procurements
Benchmarking is an essential part of the procurement of HPC systems to ensure that the requested features of the new cluster will and was delivered by the vendor as promised. Especially, before the request of proposal (RFP) of an HPC procurement is published, rigorous benchmarking on current HPC clusters and new testbed hardware must be executed to be able to "predict" performance and energy consumption of the benchmarks running on potential hardware that will be delivered by the vendors in the future. For that, the measured results must be evaluated and conclusions drawn for the HPC tender document. One additional consideration is also which benchmark to include in the HPC tender: HPC application benchmarks shall represent the workload on the cluster and shall be focus of this work.
This seminar thesis shall investigate and compare different Evaluation methodologies of HPC application benchmarks with respect to their usage in HPC procurements and acceptance tests. "Methodolgies" mean various
approaches to compare and predict benchmark evaluation results. Here, the figure of merit of evaluating benchmarks shall be in terms of performance (runtime, bandwidth, efficiency,...), energy consumption and categories that
represent the cluster workload, e.g., by so-called motifs or simply by hardware-dominant behavior (compute, memory, IO,...). HPC benchmark suites used in procurements and shall be (at least) investigated in this seminar thesis are, e.g., the JUPITER Benchmark Suite, the PRACE UEABS, the NERSC-10 Benchmark Suite or the CORAL-2 benchmarks.
Kind of topic: overview
Supervisor: Sandra Wienke
Supervisors & Organization
Tim Cramer
Joachim Jenke
Jan Kraus
Ben Thärigen
Felix Tomski
Sandra Wienke