Seminar Current Topics in High-Performance Computing

 

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.
Futhermore, we will introduce the students to "Scientific Writing" and "Scientific Presenting" in computer science. These two events are also compulsory in attendance.

The introductory event (kickoff) is scheduled for April 6th, 2022, 2pm - 4pm.
The next compulsory meeting is April 8th, 2022, 2pm - 4pm.

Furthermore, we plan to do the seminar as an in-person event (if regulations and Corona case numbers allow it). That means that you need to be personally present for all compulsory parts of the seminar.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Types of Topics

We provide two flavors of seminar topics depending on the particular topic: (a) overview topics, and (b) dive-in topics. It works as the names suggest. Nevertheless, this categorization does not necessarily imply a strict "either-or" but rather provides a guideline for addressing the topic. In general, both types of topics are equally difficult to work on. However, they have different challenges. In the topic list below, you can also find the corresponding categorizations for the seminar topic types.

 

Topics

An Overview of OpenMP Target Device Offloading to FPGAs

OpenMP is the de-facto standard for parallel shared memory programming. However, with the introduction of target device offloading, it also supports any kind of accelerator devices. The convenience, acceptance, (performance) portability and usability requires the broad availability of corresponding implementations for any kind of accelerator devices like GPUS, FPGAs or vector engines. Due to the fact that hardware synthesis for FPGAs is complex and time consuming, the design of OpenMP-to-FPGA compilers is challenging. Although several groups have developed different prototypes, this field is still in a proof-of-concept state.

In this thesis, the student should give an overview of existing approaches for OpenMP Target Device Offloading to FPGAs. This includes an in-depth analysis of the different techniques and a discussion of the opportunities and limitations of the different approaches. This discussion should include a perspective on a generic device offloading framework.

Kind of topic: overview
Supervisor: Tim Cramer

Burst Buffer: Use Cases, Evaluation, and Optimization (English only)

Burst-buffer is an intermediate storage between parallel filesystem and compute nodes on the modern HPC systems. It gains popularity in the recent years as a solution to data-heavy modern application.

In this seminar, student will explore use cases for burst-buffer, characterize and determine the performance pattern, and evaluation optimization strategy when using burst buffer.

Kind of topic: overview
Supervisor: Radita Liem

Managing and Optimizing Metadata for Data-heavy Application in HPC (English only)

Metadata performance is crucial for not only interactive users but also data-heavy applications.

In this seminar, we want to understand what are the factors that affect metadata performance and how it contributes in the overall application performance along with the strategies for optimization.

Kind of topic: overview
Supervisor: Radita Liem

An Overview of Performance Models for Coupled HPC + ML Applications

Driven by the recent advances in artificial intelligence, especially machine learning and deep learning, new use cases coupling highly parallel simulations with machine learning models are currently emerging. Such a coupled HPC + ML application requires a heterogeneous architecture combining pure CPU compute nodes with nodes equipped with one or more GPU devices. Performance analysis for these new kinds of HPC applications is challenging and work of current research because multiple different aspects have to be considered. For the communication of data between CPU and GPU nodes a network model (e.g. LogP) is needed. Moving the data from host to device memory (and vice-versa) needs to model data transfer over PCIe bus. Finally, the performance of the actual neural network inference depends on the performance of the individual operators the neural networks consists of.

In this thesis, the student should give an overview of existing performance models that cover the different aspects that need to be modeled for a coupled HPC + AI application. For each model the student should elaborate on the usability of the model in the context of coupled HPC+ML application by discussing the model's limitations.

Kind of topic: overview
Supervisor: Fabian Orland

Coupling Machine Learning with Large Eddy Simulations of Combustion

In the field of energy conversion an active topic of research is the investigation of alternative fuels (e.g. hydrogen) to be used in combustion processes. From a physical perspective a combustion process constitutes a turbulent reactive flow problem. Accurate numerical solutions can be obtained by performing Direct Numerical Simulation (DNS) to solve the governing equations. In practice this is not feasible as very high spatial- and temporal resolutions are required and thus, a Large Eddy Simulation (LES) is employed. To still capture numerical effects of the smallest length scales subgrid-scale (SGS) models need to be employed.

In recent works subgrid-scale models have been sucessfully learned by artificial and convolutional neural networks to predict, for example, reaction rates and variance of the progress variable. In this thesis, the student should present an overview of the different ideas how machine learning models can replace SGS models in Large Eddy Simulations.

Kind of topic: dive-in
Supervisor: Fabian Orland

Novel Hardware Architectures for Accelerating Machine Learning Applications

Training and inference of machine learning models involves highly parallel matrix operations that are suitable to be executed on accelerator devices such as GPGPUs. Even though GPUs are designed as general purpose devices, vendors such as Nvidia, for example, have implemented hardware features like tensor cores specifically targeting the acceleration of machine learning operations. The Japanese vendor company NEC offers an accelerator device called SX-Aurora TSUBASA providing a huge memory bandwidth paired with large vector cores. Google has also developed their own Tensor Processing Units (TPUs) specificially designed to support their machine learning framework Tensorflow. Other highly specialized architectures are the Intelligence Processing Unit (IPU) developed by Graphcore and Wafer Scale Engine developed by Cerebras.

In this thesis, the student should first give an overview of existing processor architectures used to accelerate machine learning operations. Furthermore the student should compare these architectures and discuss about their performance, cost, energy consumption or other aspects.

Kind of topic: overview
Supervisor: Fabian Orland

Fault-Tolerant MPI

Ever-increasing node counts in HPC systems lead to a mean time of failure shorter than typical job duration. Besides check-point/restart strategies, the HPC community also considers fault tolerance in MPI jobs as a possible mitigation.

The seminar thesis might look into applications using such strategy, or provide an overview of MPI extensions targeting fault tolerance support.

Kind of topic: overview
Supervisor: Joachim Protze

Static Analysis of MPI Communication Correctness

Static analysis tools can help to detect issues in MPI communication patterns. In some cases, static analysis tool can detect issues that dynamic analysis cannot detect. On the other hand, false positive reports from static analysis tools can be overwhelming.

The seminar thesis will look into the static analysis capabilities of PARCOACH as a well-known representative of static analysis tools for MPI.

Kind of topic: dive-in
Supervisor: Joachim Protze

Frameworks for MPI Verification Tools Evaluation

Dynamic and static analysis tools are used to evaluate the correctness of parallel applications. But, how reliable are those tools? For example, which parts of the MPI standard does a tool cover? Just recently, several benchmark suites were published to discover the coverage of MPI correctness analysis tools.

The seminar thesis will compare the benchmarks regarding coverage of the MPI standard and the expressiveness of their evaluation results.

Kind of topic: dive-in
Supervisor: Joachim Protze

Improving the Efficiency of MPI One-Sided Communication

The Message Passing Interface (MPI) enables nodes in a cluster to communicate to each other. The classical kind of communication in MPI is two-sided point-to-point communication (P2P): The sending node ("sender") sends the message to a receiving node ("receiver") which actively waits for the message to arrive. In MPI one-sided communication (MPI RMA) on the other hand, the sending node (called "origin") can directly modify the memory of a target node (called "target") without the target node being involved in the communication. This has the advantage that the target node does not have to wait for the message to arrive, but instead it can continue with its computation. In modern MPI implementations, one-sided communication is achieved via Remote Direct Memory Accesses (RDMA).

The current interface specification of MPI RMA has several drawbacks for certain usage patterns and leads to inefficient executions: First, MPI RMA operates at process scope, i.e., communication and synchronization operations always address the whole process, even if a process consists of multiple threads that communicate independently. This may limit scalability for hybrid programs if multiple threads perform concurrent one-sided communication operations, but waiting for completion can only be done at process scope and not at thread scope. Second, there is currently no way for a target node to be actively notified when its memory is accessed from a remote process, which penalizes producer-consumer pattern implementations.

The seminar thesis should provide an overview of the current MPI RMA specification and discuss its shortcomings for certain usage patterns as presented in the literature. Further, potential extensions for MPI RMA proposed by different researchers should be presented and compared.

Kind of topic: dive-in
Supervisor: Simon Schwitanski

On-the-Fly Data Race Detection in MPI RMA Programs Using Binary Search Trees

The Message Passing Interface (MPI) enables nodes in a cluster to communicate to each other. The classical kind of communication in MPI is two-sided point-to-point communication (P2P): The sending node ("sender") sends the message to a receiving node ("receiver") which actively waits for the message to arrive. In MPI one-sided communication (MPI RMA) on the other hand, the sending node (called "origin") can directly modify the memory of a target node (called "target") without the target node being involved in the communication. This has the advantage that the target node does not have to wait for the message to arrive, but instead it can continue with its computation. In modern MPI implementations, one-sided communication is achieved via Remote Direct Memory Accesses (RDMA).

Using MPI one-sided communication introduces a new kind of error class: Since the sending node can access remote memory directly, a concurrent access to the same memory location from the target node itself (or another origin node) can lead to data races or memory inconsistencies if no proper synchronization is enforced. RMA-Analyzer is a correctness checking tool that tries to detect such data races on-the-fly. It uses binary search trees to represent the accessed memory regions and check for memory inconsistencies.

The seminar thesis should present in details the concepts of RMA-Analyzer and discuss which kinds of data races in MPI RMA it can detect. Further, related approaches in the literature that also try to detect memory inconsistencies in MPI RMA should be identified and presented.

Kind of topic: dive-in
Supervisor: Simon Schwitanski

Computation of Non-Deterministic Message Matchings in MPI

The Message Passing Interface (MPI) is the de-facto standard for exchanging data between compute nodes in a cluster using message passing. Typically, the sending node specifies the receiver to which the message should be sent. Similarly, also the receiving node specifies explicitly from which sender it expects a message. For the receiving side however, it is also possible to wait for a message from any sending node, which is also called "wildcard receive". Those and other concepts introduce non-determinism in the MPI message exchange: If two sending nodes both send a message simultaneously to the same node which is performing a wildcard receive, then the messages might be received in any order and might lead to different execution paths in the program. Therefore, certain bugs, in particular deadlocks, might only manifest in rare cases and depend on the message matchings. This is why a testing tool has to consider all possible message matchings to find all feasible execution paths in an MPI program.

DAMPI is a dynamic verification tool that uses Lamport and vector clocks to record all possible message matchings during the execution of an MPI program. Based on that knowledge, DAMPI can replay the execution and enforce other message matchings (and therefore execution paths) by actively delaying the sending of messages. In addition, it provides heuristics to tackle the path explosion problem. By covering all possible message matchings, DAMPI can detect deadlocks and resource-leaks in applications.

The seminar thesis should present the approach of DAMPI to detect and replay possible message matchings in MPI programs. Further, related approaches should be discussed in a literature review and compared with DAMPI regarding scalability and coverage.

Kind of topic: dive-in
Supervisor: Simon Schwitanski

Supervisors & Organization

Tim Cramer
Radita Liem
Fabian Orland
Joachim Protze
Simon Schwitanski
Sandra Wienke

Ansprechpartner

Name

Sandra Wienke

Dr. rer. nat.

E-Mail

E-Mail