Seminar Current Topics in High-Performance Computing

 

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.

More information is available in RWTHmoodle.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Topics

Detection of Dependencies between Non-Sibling OpenMP Tasks

The advent of the multicore era led to the duplication of functional units through an increasing number of cores. To exploit those processors, a shared-memory parallel programming model is one possible direction. Thus, OpenMP is a good candidate to enable different paradigms: data parallelism (including loop-based directives) and control parallelism, through the notion of tasks with dependencies. But this is the programmer responsibility to ensure that data dependencies are complete such as no data races may happen. It might be complex to guarantee that no issue will occur and that all dependencies have been correctly expressed in the context of nested tasks. This paper proposes an algorithm to detect the data dependencies that might be missing on the OpenMP task clauses between tasks that have been generated by different parents. This approach is implemented inside a tool relying on the OMPT interface.

Focus of this thesis is a comparison with other relevant approaches,reproducing the measurements and applying the tool to other benchmarks.

Supervisor
Joachim Protze

Exposition, clarification, and expansion of MPI semantic terms and conventions: is a nonblocking MPI function permitted to block?

To get the last bit of performance from MPI communication, a clear understanding of the meaning of the key terminology has proven essential, and, surprisingly, important concepts remain underspecified, ambiguous and, in some cases, inconsistent and/or conflicting despite 26 years of standardization. The starting paper addresses these concerns comprehensively and informs MPI developers, implementors, those teaching and learning MPI, and power users alike about key aspects of existing conventions, syntax, and semantics.

Focus of this thesis is a deep dive into the semantics of nonblocking MPI communication.

Supervisor
Joachim Protze

Multi-threaded MPI communication

The Message Passing Interface (MPI) is the prevalent interface in high-performance computing (HPC) for distributed-memory programming. While being able to use MPI even on a shared-memory platform, combining MPI with a shared-memory programming paradigm (MPI+X) becomes more popular. However, MPI is largely thread agnostic, not offering enough possibilities to use threads effectively to take part in communication as first-class citizens. Therefore, in the past years several interfaces were proposed to remedy the situation, partitioned communication proposed by Grant et al. being the latest. This seminar thesis shall explore current and past proposals to have threads actively and efficiently engage in MPI communication.

Supervisor
Marc-André Hermanns

Planning for performance in MPI

Collective communication functions in the Message Passing Interface (MPI) provide an abstract interface to specific data exchange patterns needed in scientific simulation. Describing just the resulting distribution pattern, and not mandating a specific implementation, it enables multiple exchange algorithms to be available in the same MPI implementation that can be chosen for efficiency in a particular communication scenario. For a long time, only a an interface with blocking semantics was available. Recently, non-blocking semantics were added with MPI 3.1 and the upcoming MPI 4.0 (scheduled for release end of 2020) will include persistent non-blocking interfaces,to allow for complex algorithm setups to amortize over time.

This seminar thesis shall explore the possibilities shown in the literature to leverage these interfaces to allow for hardware offloading, communication overlap and other techniques to provide more efficient implementations.

Supervisor
Marc-André Hermanns

Evaluation on Machine Learning Methods for Scientific Software’s I/O Performance

Machine learning is a state of the art method to model complex problem with abundant affecting variables. I/O performance also a problem that is difficult to model and predict due to various factors, from the measurement method, filesystem choice, cluster workload, and the I/O pattern of the application itself. Having a good understanding of how machine learning can model I/O performance is a first step to provide an I/O optimization solution.

In this thesis, you need to examine different machine learning methods and providing insights on which method is suitable for which I/O pattern.

Supervisor
Radita Liem

Evaluation on I/O Related Metrics on Performance Measurement Tools

Assessing I/O performance in a large scale system is a non-trivial matter that there are several tools out there to do this work. Each I/O performance tool has its method to perform the assessment. A seemingly similar metric can provide different results due to various factors from the measurement method, filesystem choice, cluster workload, and the I/O pattern of the application itself. A good overview of these tools is valuable for application developers and analysts to do I/O optimization.

In this thesis, you need to examine different tools that can give insights on I/O performance in large scale systems, identifying tools characteristics and metrics provided that are meaningful for users who want to optimize I/O performance.

Supervisor
Radita Liem

Performance Tuning for Power-Capped Processors

Supercomputing as well as data centers can be overprovisioned in terms of power consumption. Much hardware is acquired in order to fulfill tasks in boost mode. In this case, the peak power draw of the centers exceeds the capacity of the power supplier. Therefore, computing units, especially the processors, need to be power capped to avoid physical damage in the environment.

To meet a power cap, the processors need to be clocked slowly where computational performance is reduced. However, the system default power-capping strategy is suboptimal regarding more performance reduction than necessary. An advanced optimizing strategy needs to be investigated in this seminar thesis.

Supervisor
Bo Wang

Employing the Roofline Model for Energy Optimization

A large-scale HPC cluster has a high power draw. The energy cost belongs to the most important cost factors. To reduce this cost as well as CO2 emission, an efficient energy strategy is looked for where a compromise between energy and computational performance needs to be made.

The roofline model is an easy-understandable, visual model mainly employed for performance tuning. At the same time, it explores vital facts for energy optimization. In this work, you are expected to study the classical roofline model and extend it for energy optimization.

Supervisor
Bo Wang

The Future of Vector Computing: Performance Evaluation of New Approaches

In High Performance Computing (HPC) vector architectures have quite a long tradition. Although the emergence of x86 commodity clusters interrupted this trend for a while, we see a renaissance of SIMD capabilities in all modern architectures. Not only the success of general GPU computation during the last decade, but also the trend to longer SIMD registers in common CPUs contributes to this development. New kinds of accelerators like the vector engine SX-Aurora TSUBASA underline importance for modern processor designs.

This seminar article and talk is expected to give a detailed overview of the principles of vector computing and how it is implemented in the SX-Aurora TSUBASA, including the hardware and the execution model. Furthermore, it is expected to discuss existing performance evaluations and compare the results with other architectures with vector capabilities (e.g. Intel Xeon or Nvidia GPUs). With respect to metric like the bytes/flops ration of the given codes, the seminar candidate should give an overview on which code characteristics are promising for such a new architecture.

Supervisor
Tim Cramer

Data-Based Anomaly and Failure Detection in Complex HPC Systems

The increase of the number of different components in HPC systems leads to an increase of the number of failures. This leads to a threat for continuous operation of these systems. Consequently, detecting of failures as early as possible is essential in order to avoid breakdowns. For the detection of anomalies and failure different data-based approaches (e.g., statistical vicinities, data analytics temporal/spatial correlations) exist. This seminar artical and talk is expect to present different approaches for failure and anomaly detection. Furthermore, a discussion and comparison of these different approaches with respect to the applicability to other clusters is expected.

Supervisor
Tim Cramer

Evolution of Parallel Matrix-Matrix-Multiplication

One of the most fundamental operations used in scientific computations involving linear algebra algorithms is the matrix-matrix-multiplication. Since the computational complexity scales cubically with the problem size it likely accounts for a significant portion of the runtime of an HPC application. Hence, optimizing this operation is of crucial interest to speed up a broad variety of applications that build upon it. Traditional algorithms decompose the matrix data and map it onto a grid of processors. Most recently a new algorithm called COSMA has been developed which optimizes the matrix-matrix-multiplication by minimizing data transfers.

The thesis should present the evolution of parallel matrix-matrix-multiplication algorithms by giving an overview of different algorithms eventually concluding with the COSMA algorithm.

Supervisor
Fabian Orland

Computational Challenges in the Simulation of Red Blood Cell Flow

Simulating the fluid mechanics of blood flow at a cell-scale is of great importance in order to understand phenoma like blood clotting. Moreover it also supports the design of micrometer-scale devices that are able to identify and remove malaria-infected cells. From a computational perspective these kind of simulations are very complex to perform. In one microliter of blood millions of red blood cells can be found. Each red blood cell is highly deformable. The vessels through which red blood cells flow are represented by complex geometries. Finally, the simulation also needs to handle collisions between individual red blood cells. In order to make these simulations feasible large scale HPC systems need to be utilized.

The thesis should present the process of simulating the flow of red blood cells through vascular networks while highlighting computational challenges and their parallel solutions.

Supervisor
Fabian Orland

A Parallel Eigensolver for Sequences of Eigenvalue Problems

Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in many areas of computational science. Since the computational effort scales cubically with the system size, the development of efficient parallel algorithms is highly desirable. Until recently, iterative solvers could not compete if the matrix was dense and a considerable part of the spectrum needed to be found. The ChASE library (Chebyshev Accelerated Subspace iteration Eigensolver) can be applied for large dense matrices, but it also performs exceptionally well for the sequences of Hermitian eigenproblems. It greatly benefits from the sequence’s spectral properties and outperforms direct solvers in many scenarios.

The thesis should present the algorithm of the ChASE library and explain what distinguishes it from other solvers.

Supervisor
Uliana Alekseeva

Instructors

Joachim Protze
Marc-André Hermanns
Radita Liem
Fabian Orland
Tim Cramer
Uliana Alekseeva
Bo Wang
Sandra Wienke