Seminar Current Topics in High-Performance Computing

 

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.

More information is available in RWTHmoodle.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Topics

A Massively Parallel Infrastructure for Adaptive Multiscale Simulations

When computationally modelling biological (or other soft-matter) systems and processes, one often faces a particular challenge: the phenomena under investigation depend on the microscopic details but should evolve over much larger, macroscopic length- and time-scales. Multiscale modelling has become increasingly important to bridge this gap. Quite another challenge is to execute such models on current petascale computers with their high levels of parallelism and heterogeneous architectures. A recent answer to these challenges is a massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), which couples a macro scale model spanning micrometer length- and millisecond time-scales with a micro scale model employing molecular dynamics simulations. MuMMI is a transferable infrastructure designed for scalability and efficient execution on heterogeneous architectures: a center workflow manager simultaneously allocates GPUs and CPUs while robustly handling failures in compute nodes, communication networks, and filesystems.

The thesis should present the MuMMI infrastructure, elucidate its machinery, and compare it with other approaches.

Supervisor
Uliana Alekseeva

Data-centric Programming Models for High-Performance Computing

The demand for computationally intensive problems and especially problems including huge datasets is raising rapidly. To satisfy this demand, computing resources are becoming more specialized and heterogeneous. Thus, the application developers require high development efforts to efficiently map to such hardware architectures. To reduce these efforts and accelerate the development, several data-centric programming models have been proposed. These include software abstractions and domain-specific languages that allow for high-level specifications of the applications. Furthermore, runtime systems and resource managers were proposed, to ease the mapping to heterogeneous systems.

The goal of this seminar thesis is to provide an overview of the landscape of the data-centric programming models for HPC workloads. This includes a comparison of the models and an investigation into their use cases. If possible, experiments with different models can be carried out.

Supervisor
Julian Miller

Auto-tuning of High-Performance Computing Applications

Auto-tuning describes a technique to optimize the performance of applications empirically. It leverages data collected during the execution of an application which can extend the static analyses of compilers. Furthermore, it allows for machine dependent optimization and, thus, support the performance portability of applications. It typically consists of an application-specific search space that includes the tuning parameters, a cost function to optimize for, and an automatic search algorithm to minimize the cost function. The design of such an auto-tuner can require extensive knowledge from the domain developer and was therefore typically used for optimized libraries or kernels in the domain of HPC.

This seminar thesis shall analyze the state-of the art auto-tuning frameworks for HPC applications and investigate the use of general auto-tuning frameworks. If possible, experiments with the investigated auto-tuning frameworks can be carried out.

Supervisor
Julian Miller

Pattern-based Languages for High-Performance Computing

Parallel programming is a challenging and time-consuming task. To improve the development process, best practices in form of parallel design patterns are leveraged. They contain template solutions for commonly occurring problems such as map and reduce operations. Several approaches for integrating these design patterns in the development process were proposed including language extensions, parallel programming models, intermediate representations, and development processes to find and explore parallelism.

The goal of this seminar thesis is to investigate the different pattern-based languages for HPC applications. This includes a comparison of the models and an investigation into their use cases. If possible, experiments with different models can be carried out.

Supervisor
Julian Miller

Evolution of Parallel Matrix-Matrix-Multiplication

One of the most fundamental operations used in scientific computations involving linear algebra algorithms is the matrix-matrix-multiplication. Since the computational complexity scales cubically with the problem size it likely accounts for a significant portion of the runtime of an HPC application. Hence, optimizing this operation is of crucial interest to speed up a broad variety of applications that build upon it. Traditional algorithms decompose the matrix data and map it onto a grid of processors. Most recently a new algorithm called COSMA has been developed which optimizes the matrix-matrix-multiplication by minimizing data transfers.

The thesis should present the evolution of parallel matrix-matrix-multiplication algorithms by giving an overview of different algorithms eventually concluding with the COSMA algorithm.

Supervisor
Fabian Orland

AI in HPC applications: Learning to simulate complex physical models

Data driven approaches such as Machine Learning become more and more popular across different diciplines. In computer vision, for example, neural networks have been successfully trained to recognize humans in an image or video stream to be used in self-driving cars. Moreover, in speech recognition neural networks are able to translate from one language to another language.

In HPC-related software these techniques are not yet used very often. HPC applications simulating complex physics need to produce results with a high accuracy. Machine learning techniques naturally introduce a certain error when learning a certain model. These methods will probably not fully replace any scientific simulation approach that has been grown over the many decades but they can complement and support them. They are able to produce a preview of a realistic physics simulation within a small fraction of time compared to a full fletched simulation run which can be used for prototyping or exploring different parameters of physical models. Also, in cases where no solution strategy, equation or model is known yet machine learning can be used to investigate these cases.

Recently, convolutional neural networks were successfully applied to accelerate an Eulerian fluid simulation by replacing the very expensive "pressure projection" step with a trained network. Furthermore, another work proposed an "physics informed neural network" that is able to accurately learn partial differential equations describing complex physics.

In this thesis the student should present an overview of the current state of the art of using machine learning techniques in HPC applications to learn complex physical models. Therefore the student should start by investigating "physics informed neural networks" and then exploring different other approaches. The student should present current approaches from different fields such as computational fluid dynamics (CFD), molecular dynamics (MD), density functional theory (DFT), or others. Finally, the student should also discuss problems and limitations of each presented approach.

Supervisor
Fabian Orland

Detecting Memory Consistency Errors in MPI One-Sided Applications

The Message Passing Interface (MPI) enables nodes in a cluster to communicate to each other. The classical kind of communication in MPI two-sided point-to-point communication: The sending node sends the message to a receiving node which actively waits for the message to arrive. In MPI one-sided communication on the other hand, the sending node can directly modify the memory of a target node without the target node being involved in the communication. This has the advantage that the receiving node does not have to wait for the message to arrive, but instead it can continue with its computation. In modern MPI implementations, one-sided communication is achieved via Remote Direct Memory Accesses (RDMA).

Using MPI one-sided communication introduces a new kind of error class: Since the sending node can access remote memory directly, a concurrent access to the same memory location from the target node itself (or another sending node) can lead to memory inconsistencies if no proper synchronization is enforced. MC-CChecker is a correctness checking tool that tries to detect these kinds of memory inconsistencies using vector clocks to track causality between memory accesses.

The goal of the seminar thesis is to give a short overview of the different kinds of memory consistency errors that can occur in MPI one-sided communication. Then, the main concepts of the correctness checking tool MC-CChecker should be presented. Further, a literature review of approaches related to MC-CChecker should be presented.

Supervisor
Simon Schwitanski

A Survey of Vector Clock Compression Techniques

Vector clocks belong to the class of so called logical clocks used to track causality in distributed systems. For a system with n processors, each processor manages an array of n integers that represents the vector clock. Whenever an event (memory access, synchronization) occurs at a processor i, the processor increments the i-th entry of its locally stored vector clock. If the event is a synchronization event, then the vector clocks are exchanged and merged between participating processes depending on the kind of synchronization. Based on the vector clock information, we can say for any pair (a,b) of two events in the system, that "a happens before b", "b happens before a" or "a is concurrent to b". This information is in particular useful for data race detectors or concurrency bug detection tools in general.

A significant drawback of the vector clock approach is its linearly growing size with a growing number of processors in the system. For a system with a high number of processors, vector clocks lead to a high amount of communication and storage cost. In order to avoid this overhead, different compression techniques have been proposed in the past. On the one hand, there are approaches that exploit knowledge about the topology of the distributed system or the semantics of the communication protocol to only send parts of the vector clock over the network without losing accuracy. On the other hand, there are approaches proposing alternative ways of encoding the vector clocks, e.g. using prime numbers or a probabilistic data structure like a Bloom filter.

The thesis should provide an overview of the different kinds of vector clock compression techniques. This includes a literature review of the most important techniques that have been proposed in the past. Further, the different techniques should be compared, in particular in terms of accuracy as well as storage and communication overhead.

Supervisor
Simon Schwitanski

CIVL: Formal Verification of Parallel Programs

Verification of parallel programs via static analysis is a challenging task: Beside the state explosion due to the large number of different execution paths / schedules, another problem are the different ways of writing parallel programs using different "dialects": MPI for distributed memory, OpenMP for shared memory, CUDA for GPUs, etc. This requires adaptation of verification algorithms to the syntax and semantics of each parallel programming model, it gets even more complex if a combination of parallel programming models ("hybrid programming") is used in a single program. CIVL-C ("Concurrency Intermediate Verification Language") tackles these problems: It is a generic C language enriched with generic concurrency constructs. Programs written in any concurrency dialect (MPI, OpenMP, CUDA) can be translated to CIVL-C. Verification algorithms or tools based on CIVL-C can then just run on a program with any concurrency dialect (MPI, OpenMP, CUDA) by running on the translated CIVL-C program. In other words, verifying a new concurrency dialect only requires writing a translator to the CIVL-C language.

The thesis should give an overview of the CIVL-C approach and highlight its strengths and weaknesses. Optionally, own experiments regarding precision of the approach can be performed.

Supervisor
Simon Schwitanski

Instructors

Uliana Alekseeva
Julian Miller
Fabian Orland
Daniel Schürhoff
Simon Schwitanski