Topics
Benchmarking All-Flash Storage for HPC
Modern HPC Systems use complex storage setups, often consisting of a multitude of technologies. This includes different file systems for different use cases, as well as differing hardware architectures. The shift to all-flash or hybrid storage systems is in full swing and has strong implications for performance expectations. Benchmarking I/O performance is an important part of predicting HPC application performance and also plays a crucial role in new hardware and software procurements.
For this topic, the student is expected to understand and contextualize benchmark results from a given paper. This includes a discussion of the presented results and the presented experimental setup. In order to contextualize the work, research on the nature of HPC applications and file system setups will be necessary. The source is rich in specialized terms and technologies, but not all require an in-depth exploration.
Kind of topic: dive-in
Supervisor: Philipp Martin
Improving HPC I/O Performance for Specific Applications
Some HPC applications require a significant amount of file I/O, to the point where that aspect may become more important for the overall runtime of an application than the compute or CPU performance. Tuning these applications to run faster requires approaches that target the file system layer. Several technologies exist that may yield performance improvements.
The student will receive a case study of performance improvements for a specific use case, namely that of geophysical models. These are typical HPC applications that were tuned by introducing a new file format and a new file system. The underlying technologies will have to be understood and the benchmark results will have to be evaluated. A critical examination of whether or not these approaches may yield similar results for other types of applications is desirable.
Kind of topic: dive-in
Supervisor: Philipp Martin
Checkpointing for I/O-using HPC Applications
Checkpointing is a mechanism that writes out computed data at certain intervals in an application's execution. This is done as fault mitigation and therefore trades overall performance with safety. It is beneficial to know the extent of this checkpointing in order to estimate its impact on performance.
The starting paper proposes a model for checkpointing in HPC applications that also perform I/O on their own. Since this is a journal paper, it is longer than the conference papers for the other topics. While this constitutes more initial reading effort by the student, that will be offset by having to do less additional research, since the paper provides more in-depth explanations of the underlying technologies and related work. The student will have to explain the proposed model and evaluate its usefulness. In order to achieve this, they will need to understand the underlying technologies and how checkpointing works in general.
Kind of topic: dive-in
Supervisor: Philipp Martin
Distributed memory execution of nested OpenMP-style tasks
State-of-the-art programming approaches generally have a strict division between intra-node shared memory parallelism and inter-node MPI communication. Tasking with dependencies offers a clean, dependable abstraction for a wide range of hardware and situations within a node, but research on task offloading between nodes is still relatively immature. Asynchronous task parallelism avoids synchronization that is often required in MPI+OpenMP tasks.
This seminar thesis will compare a task offloading extension of the OmpSs-2 programming model with CHAMELEON (reactive load balancing for hybrid MPI+OpenMP task-parallel applications) and MPI continuations + OpenMP detached tasks.
The extension of the OmpSs-2 programming model enables overlapping of the construction of the distributed dependency graph, enforcing of dependencies, transferring of data, and task execution. CHAMELEON dynamically offloads the execution of OpenMP tasks to other MPI processes for dynamic load balancing. The combination of MPI continuations + OpenMP detached tasks allows to create task dependency graphs across MPI processes.
Kind of topic: dive-in (also possible as overview topic)
Supervisor: Joachim Jenke
Using OpenMP tasks to balance the load in numeric simulations
Numerical simulations on adaptive meshes often suffer from load imbalances. With motion in the simulation domain, the adaptive mesh needs to dynamically refine which finally leads to moving load imbalances over time. Splitting the work into tasks can help balancing the load, but can at the same time lead to runtime overhead based on the granularity of the generated tasks.
This seminar thesis will compile best practices and lessons learned from porting shared-memory applications into tasking applications and optimizing the performance of tasking applications.
Kind of topic: overview
Supervisor: Joachim Jenke
Using the tasking paradigm in heterogeneous architectures
Task-based systems have gained popularity as they promise to exploit the computational power of complex heterogeneous systems. A common programming model is the so-called Sequential Task Flow (STF) model, which, unfortunately, has the intrinsic limitation of supporting static task graphs only. This leads to potential submission overhead and to a static task graph not necessarily adapted for execution on heterogeneous systems. A standard approach is to find a trade-off between the granularity needed by accelerator devices and the one required by CPU cores to achieve performance.
This seminar thesis will compare a recent extension of the STF model in StarPU enabling subgraphs at runtime with features provided by TaskFlow or OpenMP tasking.
Kind of topic: dive-in
Supervisor: Joachim Jenke
Representing applications using dynamic parallel patterns
Parallel patterns help application programmers to focus mostly on their algorithm and do not concern themselves with the general hurdles of parallel programming, e.g., data races, data movements, deadlocks etc. While static parallel patterns, parallel patterns where the workload is known at compile-time, can be optimized at compile-time dynamic parallel patterns may need additional optimization at runtime. To better understand the possibilities of compile-time optimizations of dynamic parallel patterns an overview on different dynamic parallel patterns is necessary.
The student will look at different approaches using parallel patterns, structured parallel programming, and comprise a set of dynamic parallel patterns.
Kind of topic: overview
Supervisor: Adrian Schmitz
High-Level code representation with MLIR
Various domain specific languages (DSL) are becoming more and more important for highly specialized tasks. Building such DSLs Tools like ANTLR or Monticore are helpful, but in an HPC context reusing existing optimizations to leverage applicable performance. To this end, MLIR can be used to automatically lower the representation to LLVM IR for low-level optimizations alongside the high level optimizations in MLIR.
The student will introduce MLIR and some of its use cases and compare it to the LLVM IR for high-level compiler optimizations.
Kind of topic: dive-in
Supervisor: Adrian Schmitz
Dataflow optimizations in C using SDFGs
The dataflow of an application allows for many different and automated optimization. Stateful DataFlow multiGraphs (SDFG) allows such dataflow optimizations for heterogeneous architectures by representing the application in a graph structure. By lifting existing low-level C code to a high-level representation like the SDFGs, existing code can still be optimized without touching the original code.
The student will look at SDFG optimizations of C code and outline the workflow and possible optimizations using SDFGs.
Kind of topic: dive-in
Supervisor: Adrian Schmitz
Regression Testing Frameworks for HPC Systems
Whether to check the operational status of an HPC system after a hardware or software update, or monitor its performance over time, regression tests are important for maintaining an HPC system. In the past, a number of frameworks for regression tests have emerged to ease the implementation of new tests, improve maintainability of existing tests and increase portability of tests. The currently available frameworks vary in their approach, their focus on type of tests and their richness of the already included test suite.
In this thesis, the student should first present an overview of different frameworks for regression tests. Additionally, the presented frameworks shall be compared regarding supported type of tests (diagnostic vs. benchmarking etc.) as well as already included tests and the extendibility. A subsequent discussion should make suggestions on which framework to use for a specific purpose (checking the operational state after an update versus monitoring the performance over time).
Kind of topic: overview
Supervisor: Felix Tomski
Using Symbolic Execution for MPI Correctness Checking
Tools for program verification can be divided into static and dynamic methods. However, static tools generally suffer from false positives due to their inherited abstraction, while dynamic tools are often overstrain with the non-determinism in distributed programs due to working on concrete program input and executing a specific execution path. Symbolic execution on the other hand executes the program with symbolic inputs and thus explores different execution paths of an MPI program.
In this thesis, the student should do a literature review on the use cases of symbolic execution for testing MPI programs. Thereby the following questions should be discussed: Where lay the potential use cases of symbolic execution in the field of MPI correctness checking? Are there any type of errors in MPI programs for which symbolic execution seems to be better suited than static or dynamic methods? Where are the limitations? What approaches do already exist to test MPI programs with the help of symbolic execution?
Kind of topic: dive-in
Supervisor: Felix Tomski
Offloading Non-Blocking MPI Collectives to the Network
Since the introduction of non-blocking collectives in MPI-3, there have been many proposed approaches to increase the overlapping of communication and computation. Besides software solutions, such as offloading the communication task to an additional thread, recent hardware also allows for hardware-based solutions which decreases the workload on the host system. Theoretically, the additional hardware appears to have the potential to further disburden the host system achieve higher overlapping of communication and computation.
The student should first discuss (recently) proposed methods to achieve overlapping of communication and computation for non-blocking MPI collectives. In a second step, the different approaches should be compared, e.g. regarding: Is the approach generally applicable to collective operations? For which collectives does one approach perform better than the others? Does it require special hardware or are changes to the software sufficient? Which amount of overlap was achieved?
Kind of topic: overview
Supervisor: Felix Tomski
Supervisors & Organization
Philipp Martin
Joachim Protze
Adrian Schmitz
Felix Tomski
Sandra Wienke