Seminar Current Topics in High-Performance Computing

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.

More information is available in RWTHmoodle.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Topics

A Massively Parallel Infrastructure for Adaptive Multiscale Simulations

When computationally modelling biological (or other soft-matter) systems and processes, one often faces a particular challenge: the phenomena under investigation depend on the microscopic details but should evolve over much larger, macroscopic length- and time-scales. Multiscale modelling has become increasingly important to bridge this gap. Quite another challenge is to execute such models on current petascale computers with their high levels of parallelism and heterogeneous architectures. A recent answer to these challenges is a massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), which couples a macro scale model spanning micrometer length- and millisecond time-scales with a micro scale model employing molecular dynamics simulations. MuMMI is a transferable infrastructure designed for scalability and efficient execution on heterogeneous architectures: a center workflow manager simultaneously allocates GPUs and CPUs while robustly handling failures in compute nodes, communication networks, and filesystems.

The thesis should present the MuMMI infrastructure, elucidate its machinery, and compare it with other approaches.

Supervisor
Uliana Alekseeva

Detecting Memory Consistency Errors in MPI One-Sided Applications

The Message Passing Interface (MPI) enables nodes in a cluster to communicate to each other. The classical kind of communication in MPI two-sided point-to-point communication: The sending node sends the message to a receiving node which actively waits for the message to arrive. In MPI one-sided communication on the other hand, the sending node can directly modify the memory of a target node without the target node being involved in the communication. This has the advantage that the receiving node does not have to wait for the message to arrive, but instead it can continue with its computation. In modern MPI implementations, one-sided communication is achieved via Remote Direct Memory Accesses (RDMA).

Using MPI one-sided communication introduces a new kind of error class: Since the sending node can access remote memory directly, a concurrent access to the same memory location from the target node itself (or another sending node) can lead to memory inconsistencies if no proper synchronization is enforced. MC-CChecker is a correctness checking tool that tries to detect these kinds of memory inconsistencies using vector clocks to track causality between memory accesses.

The goal of the seminar thesis is to give a short overview of the different kinds of memory consistency errors that can occur in MPI one-sided communication. Then, the main concepts of the correctness checking tool MC-CChecker should be presented. Further, a literature review of approaches related to MC-CChecker should be presented.

Supervisor
Simon Schwitanski

A Survey of Vector Clock Compression Techniques

Vector clocks belong to the class of so called logical clocks used to track causality in distributed systems. For a system with n processors, each processor manages an array of n integers that represents the vector clock. Whenever an event (memory access, synchronization) occurs at a processor i, the processor increments the i-th entry of its locally stored vector clock. If the event is a synchronization event, then the vector clocks are exchanged and merged between participating processes depending on the kind of synchronization. Based on the vector clock information, we can say for any pair (a,b) of two events in the system, that "a happens before b", "b happens before a" or "a is concurrent to b". This information is in particular useful for data race detectors or concurrency bug detection tools in general.

A significant drawback of the vector clock approach is its linearly growing size with a growing number of processors in the system. For a system with a high number of processors, vector clocks lead to a high amount of communication and storage cost. In order to avoid this overhead, different compression techniques have been proposed in the past. On the one hand, there are approaches that exploit knowledge about the topology of the distributed system or the semantics of the communication protocol to only send parts of the vector clock over the network without losing accuracy. On the other hand, there are approaches proposing alternative ways of encoding the vector clocks, e.g. using prime numbers or a probabilistic data structure like a Bloom filter.

The thesis should provide an overview of the different kinds of vector clock compression techniques. This includes a literature review of the most important techniques that have been proposed in the past. Further, the different techniques should be compared, in particular in terms of accuracy as well as storage and communication overhead.

Supervisor
Simon Schwitanski

CIVL: Formal Verification of Parallel Programs

Verification of parallel programs via static analysis is a challenging task: Beside the state explosion due to the large number of different execution paths / schedules, another problem are the different ways of writing parallel programs using different "dialects": MPI for distributed memory, OpenMP for shared memory, CUDA for GPUs, etc. This requires adaptation of verification algorithms to the syntax and semantics of each parallel programming model, it gets even more complex if a combination of parallel programming models ("hybrid programming") is used in a single program. CIVL-C ("Concurrency Intermediate Verification Language") tackles these problems: It is a generic C language enriched with generic concurrency constructs. Programs written in any concurrency dialect (MPI, OpenMP, CUDA) can be translated to CIVL-C. Verification algorithms or tools based on CIVL-C can then just run on a program with any concurrency dialect (MPI, OpenMP, CUDA) by running on the translated CIVL-C program. In other words, verifying a new concurrency dialect only requires writing a translator to the CIVL-C language.

The thesis should give an overview of the CIVL-C approach and highlight its strengths and weaknesses. Optionally, own experiments regarding precision of the approach can be performed.

Supervisor
Simon Schwitanski

Instructors

Uliana Alekseeva
Julian Miller
Fabian Orland
Daniel Schürhoff
Simon Schwitanski