Topics
An Overview of OpenMP Target Device Offloading to FPGAs
OpenMP is the de-facto standard for parallel shared memory programming. However, with the introduction of target device offloading, it also supports any kind of accelerator devices. The convenience, acceptance, (performance) portability and usability requires the broad availability of corresponding implementations for any kind of accelerator devices like GPUS, FPGAs or vector engines. Due to the fact that hardware synthesis for FPGAs is complex and time consuming, the design of OpenMP-to-FPGA compilers is challenging. Although several groups have developed different prototypes, this field is still in a proof-of-concept state.
In this thesis, the student should give an overview of existing approaches for OpenMP Target Device Offloading to FPGAs. This includes an in-depth analysis of the different techniques and a discussion of the opportunities and limitations of the different approaches. This discussion should include a perspective on a generic device offloading framework.
Kind of topic: overview
Supervisor: Tim Cramer
Burst Buffer: Use Cases, Evaluation, and Optimization (English only)
Burst-buffer is an intermediate storage between parallel filesystem and compute nodes on the modern HPC systems. It gains popularity in the recent years as a solution to data-heavy modern application.
In this seminar, student will explore use cases for burst-buffer, characterize and determine the performance pattern, and evaluation optimization strategy when using burst buffer.
Kind of topic: overview
Supervisor: Radita Liem
Managing and Optimizing Metadata for Data-heavy Application in HPC (English only)
Metadata performance is crucial for not only interactive users but also data-heavy applications.
In this seminar, we want to understand what are the factors that affect metadata performance and how it contributes in the overall application performance along with the strategies for optimization.
Kind of topic: overview
Supervisor: Radita Liem
An Overview of Performance Models for Coupled HPC + ML Applications
Driven by the recent advances in artificial intelligence, especially machine learning and deep learning, new use cases coupling highly parallel simulations with machine learning models are currently emerging. Such a coupled HPC + ML application requires a heterogeneous architecture combining pure CPU compute nodes with nodes equipped with one or more GPU devices. Performance analysis for these new kinds of HPC applications is challenging and work of current research because multiple different aspects have to be considered. For the communication of data between CPU and GPU nodes a network model (e.g. LogP) is needed. Moving the data from host to device memory (and vice-versa) needs to model data transfer over PCIe bus. Finally, the performance of the actual neural network inference depends on the performance of the individual operators the neural networks consists of.
In this thesis, the student should give an overview of existing performance models that cover the different aspects that need to be modeled for a coupled HPC + AI application. For each model the student should elaborate on the usability of the model in the context of coupled HPC+ML application by discussing the model's limitations.
Kind of topic: overview
Supervisor: Fabian Orland
Coupling Machine Learning with Large Eddy Simulations of Combustion
In the field of energy conversion an active topic of research is the investigation of alternative fuels (e.g. hydrogen) to be used in combustion processes. From a physical perspective a combustion process constitutes a turbulent reactive flow problem. Accurate numerical solutions can be obtained by performing Direct Numerical Simulation (DNS) to solve the governing equations. In practice this is not feasible as very high spatial- and temporal resolutions are required and thus, a Large Eddy Simulation (LES) is employed. To still capture numerical effects of the smallest length scales subgrid-scale (SGS) models need to be employed.
In recent works subgrid-scale models have been sucessfully learned by artificial and convolutional neural networks to predict, for example, reaction rates and variance of the progress variable. In this thesis, the student should present an overview of the different ideas how machine learning models can replace SGS models in Large Eddy Simulations.
Kind of topic: dive-in
Supervisor: Fabian Orland
Novel Hardware Architectures for Accelerating Machine Learning Applications
Training and inference of machine learning models involves highly parallel matrix operations that are suitable to be executed on accelerator devices such as GPGPUs. Even though GPUs are designed as general purpose devices, vendors such as Nvidia, for example, have implemented hardware features like tensor cores specifically targeting the acceleration of machine learning operations. The Japanese vendor company NEC offers an accelerator device called SX-Aurora TSUBASA providing a huge memory bandwidth paired with large vector cores. Google has also developed their own Tensor Processing Units (TPUs) specificially designed to support their machine learning framework Tensorflow. Other highly specialized architectures are the Intelligence Processing Unit (IPU) developed by Graphcore and Wafer Scale Engine developed by Cerebras.
In this thesis, the student should first give an overview of existing processor architectures used to accelerate machine learning operations. Furthermore the student should compare these architectures and discuss about their performance, cost, energy consumption or other aspects.
Kind of topic: overview
Supervisor: Fabian Orland
Fault-Tolerant MPI
Ever-increasing node counts in HPC systems lead to a mean time of failure shorter than typical job duration. Besides check-point/restart strategies, the HPC community also considers fault tolerance in MPI jobs as a possible mitigation.
The seminar thesis might look into applications using such strategy, or provide an overview of MPI extensions targeting fault tolerance support.
Kind of topic: overview
Supervisor: Joachim Protze
Static Analysis of MPI Communication Correctness
Static analysis tools can help to detect issues in MPI communication patterns. In some cases, static analysis tool can detect issues that dynamic analysis cannot detect. On the other hand, false positive reports from static analysis tools can be overwhelming.
The seminar thesis will look into the static analysis capabilities of PARCOACH as a well-known representative of static analysis tools for MPI.
Kind of topic: dive-in
Supervisor: Joachim Protze
Frameworks for MPI Verification Tools Evaluation
Dynamic and static analysis tools are used to evaluate the correctness of parallel applications. But, how reliable are those tools? For example, which parts of the MPI standard does a tool cover? Just recently, several benchmark suites were published to discover the coverage of MPI correctness analysis tools.
The seminar thesis will compare the benchmarks regarding coverage of the MPI standard and the expressiveness of their evaluation results.
Kind of topic: dive-in
Supervisor: Joachim Protze
Improving the Efficiency of MPI One-Sided Communication
The Message Passing Interface (MPI) enables nodes in a cluster to communicate to each other. The classical kind of communication in MPI is two-sided point-to-point communication (P2P): The sending node ("sender") sends the message to a receiving node ("receiver") which actively waits for the message to arrive. In MPI one-sided communication (MPI RMA) on the other hand, the sending node (called "origin") can directly modify the memory of a target node (called "target") without the target node being involved in the communication. This has the advantage that the target node does not have to wait for the message to arrive, but instead it can continue with its computation. In modern MPI implementations, one-sided communication is achieved via Remote Direct Memory Accesses (RDMA).
The current interface specification of MPI RMA has several drawbacks for certain usage patterns and leads to inefficient executions: First, MPI RMA operates at process scope, i.e., communication and synchronization operations always address the whole process, even if a process consists of multiple threads that communicate independently. This may limit scalability for hybrid programs if multiple threads perform concurrent one-sided communication operations, but waiting for completion can only be done at process scope and not at thread scope. Second, there is currently no way for a target node to be actively notified when its memory is accessed from a remote process, which penalizes producer-consumer pattern implementations.
The seminar thesis should provide an overview of the current MPI RMA specification and discuss its shortcomings for certain usage patterns as presented in the literature. Further, potential extensions for MPI RMA proposed by different researchers should be presented and compared.
Kind of topic: dive-in
Supervisor: Simon Schwitanski
On-the-Fly Data Race Detection in MPI RMA Programs Using Binary Search Trees
The Message Passing Interface (MPI) enables nodes in a cluster to communicate to each other. The classical kind of communication in MPI is two-sided point-to-point communication (P2P): The sending node ("sender") sends the message to a receiving node ("receiver") which actively waits for the message to arrive. In MPI one-sided communication (MPI RMA) on the other hand, the sending node (called "origin") can directly modify the memory of a target node (called "target") without the target node being involved in the communication. This has the advantage that the target node does not have to wait for the message to arrive, but instead it can continue with its computation. In modern MPI implementations, one-sided communication is achieved via Remote Direct Memory Accesses (RDMA).
Using MPI one-sided communication introduces a new kind of error class: Since the sending node can access remote memory directly, a concurrent access to the same memory location from the target node itself (or another origin node) can lead to data races or memory inconsistencies if no proper synchronization is enforced. RMA-Analyzer is a correctness checking tool that tries to detect such data races on-the-fly. It uses binary search trees to represent the accessed memory regions and check for memory inconsistencies.
The seminar thesis should present in details the concepts of RMA-Analyzer and discuss which kinds of data races in MPI RMA it can detect. Further, related approaches in the literature that also try to detect memory inconsistencies in MPI RMA should be identified and presented.
Kind of topic: dive-in
Supervisor: Simon Schwitanski
Computation of Non-Deterministic Message Matchings in MPI
The Message Passing Interface (MPI) is the de-facto standard for exchanging data between compute nodes in a cluster using message passing. Typically, the sending node specifies the receiver to which the message should be sent. Similarly, also the receiving node specifies explicitly from which sender it expects a message. For the receiving side however, it is also possible to wait for a message from any sending node, which is also called "wildcard receive". Those and other concepts introduce non-determinism in the MPI message exchange: If two sending nodes both send a message simultaneously to the same node which is performing a wildcard receive, then the messages might be received in any order and might lead to different execution paths in the program. Therefore, certain bugs, in particular deadlocks, might only manifest in rare cases and depend on the message matchings. This is why a testing tool has to consider all possible message matchings to find all feasible execution paths in an MPI program.
DAMPI is a dynamic verification tool that uses Lamport and vector clocks to record all possible message matchings during the execution of an MPI program. Based on that knowledge, DAMPI can replay the execution and enforce other message matchings (and therefore execution paths) by actively delaying the sending of messages. In addition, it provides heuristics to tackle the path explosion problem. By covering all possible message matchings, DAMPI can detect deadlocks and resource-leaks in applications.
The seminar thesis should present the approach of DAMPI to detect and replay possible message matchings in MPI programs. Further, related approaches should be discussed in a literature review and compared with DAMPI regarding scalability and coverage.
Kind of topic: dive-in
Supervisor: Simon Schwitanski
Supervisors & Organization
Tim Cramer
Radita Liem
Fabian Orland
Joachim Protze
Simon Schwitanski
Sandra Wienke