Seminar Current Topics in High-Performance Computing

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.
Futhermore, we will introduce the students to "Scientific Writing" and "Scientific Presenting" in computer science. These two events are also compulsory in attendance.

The introductory event (kickoff) is scheduled for October 12th, 10am - 12pm.
The next compulsory meetings are October 14th, 10:30am - 11:45am, and October 17th, 10.30am - 12pm.

Furthermore, we plan to do the seminar as an in-person event (if regulations and Corona case numbers allow it). That means that you need to be personally present for all compulsory parts of the seminar.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Types of Topics

We provide two flavors of seminar topics depending on the particular topic: (a) overview topics, and (b) dive-in topics. It works as the names suggest. Nevertheless, this categorization does not necessarily imply a strict "either-or" but rather provides a guideline for addressing the topic. In general, both types of topics are equally difficult to work on. However, they have different challenges. In the topic list below, you can also find the corresponding categorizations for the seminar topic types.

Topics

Benchmarking All-Flash Storage for HPC

Modern HPC Systems use complex storage setups, often consisting of a multitude of technologies. This includes different file systems for different use cases, as well as differing hardware architectures. The shift to all-flash or hybrid storage systems is in full swing and has strong implications for performance expectations. Benchmarking I/O performance is an important part of predicting HPC application performance and also plays a crucial role in new hardware and software procurements.

For this topic, the student is expected to understand and contextualize benchmark results from a given paper. This includes a discussion of the presented results and the presented experimental setup. In order to contextualize the work, research on the nature of HPC applications and file system setups will be necessary. The source is rich in specialized terms and technologies, but not all require an in-depth exploration.

Kind of topic: dive-in
Supervisor: Philipp Martin

Improving HPC I/O Performance for Specific Applications

Some HPC applications require a significant amount of file I/O, to the point where that aspect may become more important for the overall runtime of an application than the compute or CPU performance. Tuning these applications to run faster requires approaches that target the file system layer. Several technologies exist that may yield performance improvements.

The student will receive a case study of performance improvements for a specific use case, namely that of geophysical models. These are typical HPC applications that were tuned by introducing a new file format and a new file system. The underlying technologies will have to be understood and the benchmark results will have to be evaluated. A critical examination of whether or not these approaches may yield similar results for other types of applications is desirable.

Kind of topic: dive-in
Supervisor: Philipp Martin

Checkpointing for I/O-using HPC Applications

Checkpointing is a mechanism that writes out computed data at certain intervals in an application's execution. This is done as fault mitigation and therefore trades overall performance with safety. It is beneficial to know the extent of this checkpointing in order to estimate its impact on performance.

The starting paper proposes a model for checkpointing in HPC applications that also perform I/O on their own. Since this is a journal paper, it is longer than the conference papers for the other topics. While this constitutes more initial reading effort by the student, that will be offset by having to do less additional research, since the paper provides more in-depth explanations of the underlying technologies and related work. The student will have to explain the proposed model and evaluate its usefulness. In order to achieve this, they will need to understand the underlying technologies and how checkpointing works in general.

Kind of topic: dive-in
Supervisor: Philipp Martin

Distributed memory execution of nested OpenMP-style tasks

State-of-the-art programming approaches generally have a strict division between intra-node shared memory parallelism and inter-node MPI communication. Tasking with dependencies offers a clean, dependable abstraction for a wide range of hardware and situations within a node, but research on task offloading between nodes is still relatively immature. Asynchronous task parallelism avoids synchronization that is often required in MPI+OpenMP tasks.

This seminar thesis will compare a task offloading extension of the OmpSs-2 programming model with CHAMELEON (reactive load balancing for hybrid MPI+OpenMP task-parallel applications) and MPI continuations + OpenMP detached tasks.

The extension of the OmpSs-2 programming model enables overlapping of the construction of the distributed dependency graph, enforcing of dependencies, transferring of data, and task execution. CHAMELEON dynamically offloads the execution of OpenMP tasks to other MPI processes for dynamic load balancing. The combination of MPI continuations + OpenMP detached tasks allows to create task dependency graphs across MPI processes.

Kind of topic: dive-in (also possible as overview topic)
Supervisor: Joachim Jenke

Using OpenMP tasks to balance the load in numeric simulations

Numerical simulations on adaptive meshes often suffer from load imbalances. With motion in the simulation domain, the adaptive mesh needs to dynamically refine which finally leads to moving load imbalances over time. Splitting the work into tasks can help balancing the load, but can at the same time lead to runtime overhead based on the granularity of the generated tasks.

This seminar thesis will compile best practices and lessons learned from porting shared-memory applications into tasking applications and optimizing the performance of tasking applications.

Kind of topic: overview
Supervisor: Joachim Jenke

Using the tasking paradigm in heterogeneous architectures

Task-based systems have gained popularity as they promise to exploit the computational power of complex heterogeneous systems. A common programming model is the so-called Sequential Task Flow (STF) model, which, unfortunately, has the intrinsic limitation of supporting static task graphs only. This leads to potential submission overhead and to a static task graph not necessarily adapted for execution on heterogeneous systems. A standard approach is to find a trade-off between the granularity needed by accelerator devices and the one required by CPU cores to achieve performance.

This seminar thesis will compare a recent extension of the STF model in StarPU enabling subgraphs at runtime with features provided by TaskFlow or OpenMP tasking.

Kind of topic: dive-in
Supervisor: Joachim Jenke

Representing applications using dynamic parallel patterns

Parallel patterns help application programmers to focus mostly on their algorithm and do not concern themselves with the general hurdles of parallel programming, e.g., data races, data movements, deadlocks etc. While static parallel patterns, parallel patterns where the workload is known at compile-time, can be optimized at compile-time dynamic parallel patterns may need additional optimization at runtime. To better understand the possibilities of compile-time optimizations of dynamic parallel patterns an overview on different dynamic parallel patterns is necessary.

The student will look at different approaches using parallel patterns, structured parallel programming, and comprise a set of dynamic parallel patterns.

Kind of topic: overview
Supervisor: Adrian Schmitz

High-Level code representation with MLIR

Various domain specific languages (DSL) are becoming more and more important for highly specialized tasks. Building such DSLs Tools like ANTLR or Monticore are helpful, but in an HPC context reusing existing optimizations to leverage applicable performance. To this end, MLIR can be used to automatically lower the representation to LLVM IR for low-level optimizations alongside the high level optimizations in MLIR.

The student will introduce MLIR and some of its use cases and compare it to the LLVM IR for high-level compiler optimizations.

Kind of topic: dive-in
Supervisor: Adrian Schmitz

Dataflow optimizations in C using SDFGs

The dataflow of an application allows for many different and automated optimization. Stateful DataFlow multiGraphs (SDFG) allows such dataflow optimizations for heterogeneous architectures by representing the application in a graph structure. By lifting existing low-level C code to a high-level representation like the SDFGs, existing code can still be optimized without touching the original code.

The student will look at SDFG optimizations of C code and outline the workflow and possible optimizations using SDFGs.

Kind of topic: dive-in
Supervisor: Adrian Schmitz

Regression Testing Frameworks for HPC Systems

Whether to check the operational status of an HPC system after a hardware or software update, or monitor its performance over time, regression tests are important for maintaining an HPC system. In the past, a number of frameworks for regression tests have emerged to ease the implementation of new tests, improve maintainability of existing tests and increase portability of tests. The currently available frameworks vary in their approach, their focus on type of tests and their richness of the already included test suite.

In this thesis, the student should first present an overview of different frameworks for regression tests. Additionally, the presented frameworks shall be compared regarding supported type of tests (diagnostic vs. benchmarking etc.) as well as already included tests and the extendibility. A subsequent discussion should make suggestions on which framework to use for a specific purpose (checking the operational state after an update versus monitoring the performance over time).

Kind of topic: overview
Supervisor: Felix Tomski

Using Symbolic Execution for MPI Correctness Checking

Tools for program verification can be divided into static and dynamic methods. However, static tools generally suffer from false positives due to their inherited abstraction, while dynamic tools are often overstrain with the non-determinism in distributed programs due to working on concrete program input and executing a specific execution path. Symbolic execution on the other hand executes the program with symbolic inputs and thus explores different execution paths of an MPI program.

In this thesis, the student should do a literature review on the use cases of symbolic execution for testing MPI programs. Thereby the following questions should be discussed: Where lay the potential use cases of symbolic execution in the field of MPI correctness checking? Are there any type of errors in MPI programs for which symbolic execution seems to be better suited than static or dynamic methods? Where are the limitations? What approaches do already exist to test MPI programs with the help of symbolic execution?

Kind of topic: dive-in
Supervisor: Felix Tomski

Offloading Non-Blocking MPI Collectives to the Network

Since the introduction of non-blocking collectives in MPI-3, there have been many proposed approaches to increase the overlapping of communication and computation. Besides software solutions, such as offloading the communication task to an additional thread, recent hardware also allows for hardware-based solutions which decreases the workload on the host system. Theoretically, the additional hardware appears to have the potential to further disburden the host system achieve higher overlapping of communication and computation.

The student should first discuss (recently) proposed methods to achieve overlapping of communication and computation for non-blocking MPI collectives. In a second step, the different approaches should be compared, e.g. regarding: Is the approach generally applicable to collective operations? For which collectives does one approach perform better than the others? Does it require special hardware or are changes to the software sufficient? Which amount of overlap was achieved?

Kind of topic: overview
Supervisor: Felix Tomski

Supervisors & Organization

Philipp Martin
Joachim Protze
Adrian Schmitz
Felix Tomski
Sandra Wienke

Contact

Sandra Wienke