Seminar Current Topics in High-Performance Computing

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.
Futhermore, we will introduce the students to "Scientific Writing" and "Scientific Presenting" in computer science. These two events are also compulsory in attendance.

The compulsory introductory event (kickoff) is scheduled for March 27th, 2023, 10am - 12pm. Note: This meeting takes place one week before the semester starts! We are sorry for that but we could not work out another date.
The next compulsory meetings are planned for April 20th 2.30pm - 4pm, April 25th 2.30pm - 4pm, and June 20th 2.30pm - 4pm.

Furthermore, we plan to do the seminar as an in-person event (if regulations and Corona case numbers allow it). That means that you need to be personally present for all compulsory parts of the seminar.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Types of Topics

We provide two flavors of seminar topics depending on the particular topic: (a) overview topics, and (b) dive-in topics. It works as the names suggest. Nevertheless, this categorization does not necessarily imply a strict "either-or" but rather provides a guideline for addressing the topic. In general, both types of topics are equally difficult to work on. However, they have different challenges. In the topic list below, you can also find the corresponding categorizations for the seminar topic types.

Topics

Integrating Quantum Computing into HPC

For the purpose of reaching more and more performant systems for HPC use cases, new hardware components or accelerators are introduced steadily to the HPC landscape. They serve the purpose of allowing a more energy-efficient processing of specific workloads compared to CPUs. Therefore, accelerator-combined systems can reach higher performance with less power consumption compared to CPU-only systems if utilized correctly. While GPUs are already common in new HPC systems, recently there is a wide variety of research around accelerators such as FPGAs, vector engines, or quantum computers. Quantum computers enable the computation of certain workloads at a high number of orders of magnitude faster than conventional computing systems. Although (theoretical) quantum computing is an active research field since decades, recent advances in quantum hardware systems gave the motivation to start thinking about how to use quantum computer in HPC.

The seminar thesis should elaborate on the integration of quantum computing into conventional HPC, both system-wise and programming-model-wise. This includes discussing current challenges with possible solutions. It should also examine other methods of integrating quantum computers besides as accelerators. Possibly, concrete performance advantages on HPC use cases should be demonstrated.

Kind of topic: overview
Supervisor: Semih Burak

Static Power Consumption Prediction based on Machine Learning Models

HPC reached a point where significantly higher performance can only be reached by more energy-efficient computing. This endeavor led to a trend of introducing a variety of accelerators that can compute certain workloads not only faster but also with less energy compared to classical CPUs, e.g. GPUs or FPGAs. While the now increasingly heterogeneous HPC system has the capability of being more performant for a given power envelope, this capability is only made use of if the parallel workloads are mapped to the hardware units correctly. Correctly means that both the performance and power consumption needs to be considered so that an execution is obtained that has an optimal trade-off between performance and energy. Such allocation decisions require a power model that can predict power consumption relatively accurately for the given workload and hardware system. This will allow, for example, to choose those frequencies or running levels for the hardware units.

This seminar thesis should elaborate on static models that can predict the energy consumption without having run the given application once. Specifically, machine learning models which take certain information from the application code and hardware characteristics into account should be analyzed and their ad- and disadvantages compared to other static (or dynamic) approaches. Possibly, the seminar thesis should sketch how to make use of solely static concepts to statically choose the energy-minimizing running levels for the parallel hardware units.

Kind of topic: overview
Supervisor: Semih Burak

NVIDIA Grace Hopper Systems: Assessing CPU+GPU on a Single Board by Performance and Energy

HPC reached a point where significantly higher performance can only be reached by more energy-efficient computing. This endeavor led to a trend of introducing a variety of accelerators that can compute certain workloads not only faster but also with less energy compared to classical CPUs, e.g. GPUs or FPGAs. In current HPC systems, the configuration of CPU and GPU with each having its own memory is getting popular. Often the slow link between CPU and GPU memory can be a performance bottleneck. With the aim of reducing the overhead induced by the necessary data transfers between GPU and CPU memory, new hardware architectures are introduced. One of them is the new NVIDIA Grace Hopper system that links CPU and GPU with NVIDIAs fast NVLINK on a single board to deliver a CPU+GPU coherent memory model.

This seminar thesis should elaborate on the performance and energy-efficiency ad- and disadvantages of this new system, after having analyzed it thoroughly. Particularly, a comparison should be made to the conventional CPU+GPU model.

Kind of topic: dive-in
Supervisor: Semih Burak

Partitioned Collective Communication

Partitioned point-to-point communication was introduced with MPI 4.0. It enables applications to indicate and query partial completion of specially initiated communications. It can be used to reduce the number of messages to be matched or mitigate slight imbalances when individual threads or tasks prepare individual parts of a larger message. After initial introduction for point-to-point communication in MPI 4.0, work was started to extend the concept to collective communication as well, both for symmetry's sake as well as to enable new application use cases and optimization potential.

This seminar thesis should take a close look at the proposed up-coming partitioned collective communication interface, its specific semantic behavior, as well as discussing application scenarios where the proposed interface can be beneficial.

Kind of topic: dive-in
Supervisor: Marc-André Hermanns

Modern C++ Language Binding for MPI

As the official C++ binding for the Message Passing Interface (MPI) have been deprecated and removed from the MPI standard, C++ application developers have to fall back on the C bindings to enable MPI communication in their applications. As especially modern C++ capabilities and practices have come a long way from the common root with C, applications developers have also strived to create abstraction libraries to enable the use of MPI functionality within a modern C++ context.

This seminar thesis should look into different approaches of those libraries, their similarities, their differences, how they enable a modern C++ application development and how far the level of support for all MPI functionalities has grown so far.

Kind of topic: overview
Supervisor: Marc-André Hermanns

Evaluation of Remote Device Offloading

Most of current HPC systems nowadays consist heterogeneous compute nodes equipped with 2 or more multi-core CPUs as well as several accelerators (e.g., GPUs) that can be used to speed up computations. Especially codes with compute-bound hotspots/kernels that are embarrassingly parallel or close to it can profit from offloading the workload to e.g., a GPUs. In the last years, OpenMP, the de-factor standard for shared-memory parallelization, has been extended with device offloading functionalities. However, as the workload, complexity and resolution of scientific and industrial applications is continuously growing, a single compute node in an HPC cluster and its accelerators might not suffice. Recent efforts investigated remote offloading to distribute work also to accelerators on different machines and reduce the time to solution.

In this seminar thesis, the student should first investigate and compare recent approaches for remote offloading. Further, the goal is to evaluate the performance and capabilities of the currently proposed OpenMP remote offloading extensions. If possible and desired, experiments with remote offloading can be carried out on the RWTH cluster.

Kind of topic: overview
Supervisor: Jannis Klinkenberg

Investigating Task-based Programming Frameworks for Distributed Memory Systems

In the past, scientific and industrial applications mainly consisted of well-balanced and regular workloads. Those workloads allowed a straightforward domain decomposition to evenly distribute work across compute nodes and the use of simple work-sharing techniques to evenly distribute the work across cores/threads within a compute node. Over the last decades and years, the complexity of and variety in applications has significantly increased. Applications employ recursive and irregular workloads where the work to perform might differ between processing units or can evolve/change over time leading to the fact that classical domain decomposition and work-sharing do not suffice to efficiently parallelize the application and ensure a proper load balance. Consequently, task-based programming paradigms were introduced - first for shared memory systems - that allow a more flexible specification of work packages and even dependencies between those. Several emerging frameworks aim to bring the task-based programming paradigm also to applications running on distributed memory systems by applying different methodologies and concepts. Examples among others are Charm++, Legion, Chameleon and TaskTorrent.

In this seminar thesis, the student should investigate the capabilities and fundamental conceptual differences of various task-based programming solutions for distributed memory. Further, strength and weaknesses of the approaches with respect to characteristics like scheduling, load balancing and dependencies between tasks should be discussed. Finally, some performance measurements for each approach should be presented.

Kind of topic: overview
Supervisor: Jannis Klinkenberg

Efficient Data Mapping in Systems with Heterogeneous Memory

Over the last decades, the architectural enhancements in high performance computing resulted in a constant increase in computational power of such machines by increasing core frequency, improving the manufacturing process and introducing multi- and many-core CPUs. In contrast, enhancements on the memory side - most systems use classical DRAM - did not equally keep up with that development resulting in a gap between computational power as well as memory latency and memory bandwidth. Consequently, several emerging memory technologies have come up with intending to close that gap by tackling challenges of DRAM. Some examples are High Bandwidth Memory (HBM and HBM2) that provides higher bandwidth but smaller memory capacity and Non-Volatile Memory (NVM) that has a larger capacity than DRAM but higher latency and lower bandwidth. Compute nodes of the next generation of super computers will most likely be equipped with a heterogeneous memory system combining classical DRAM and one or more of the aforementioned technologies. This poses new questions, e.g., where to place data and when to move data between different memory types to efficiently use resources and reduce power consumption and execution time.

In this seminar thesis, the student should present an overview of approaches for managing the data placement and movement of data between different kinds of memory. Further, the seminar thesis should discuss the design choices tackling the aforementioned questions and illustrate performance results for scientific applications.

Kind of topic: overview
Supervisor: Jannis Klinkenberg

Distributed Network Topologies for Large-Scale ML/DL Workloads (English only)

Large-scale Machine Learning (ML) and Deep Learning (DL) workloads are becoming ubiquitous due to the recent advances in artificial intelligence, the availability of large data sets, and improvements in hardware architectures. HPC clusters, nowadays, are fully subscribed with ML/DL jobs. To parallelize such jobs, a number of multi-dimensional communication distribution strategies have been developed. Understanding why different communication patterns favor certain topologies is essential for efficient computing and optimal utilization of computing resources.

In this seminar thesis, a student will review distributed network topologies for ML/DL applications, provide an analysis of their performance and discuss the pros and cons of each approach.

Kind of topic: overview
Supervisor: Anara Kozhokanova

Automatic Transformation of Blocking into Non-Blocking MPI Communication

The Message Passing Interface (MPI) is the de-facto standard for message passing and communication between compute nodes. It defines the concept of blocking and non-blocking communication. In blocking directives, the calling process has to wait until the communication finishes. In contrast to that a non-blocking communication consists of two parts, an initialization and a completion call, which define a time window in which the communication has to occur. In between these two calls, the calling process can perform other operations, for example, calculations on buffers unrelated to the communication. As the introduction of non-blocking communication in existing or new codes comes with new challenges like race conditions, the usage of non-blocking communication in legacy applications is not widespread. To simplify the usage and support the user, different works researched the possibility of automatically transforming blocking into non-blocking communication.

This seminar thesis should present the most recent approaches for automatic transformation of blocking into non-blocking MPI communication and compare them against previous solutions.

Kind of topic: dive-in
Supervisor: Isa Thärigen

OpenMP Taskloop Dependences

OpenMP is a well-known standard for shared-memory parallel computing. One of the easiest ways to parallelize an application with OpenMP is to make use of worksharing constructs like omp parallel for, which parallelizes a for-loop for the user. However, for applications with irregular or recursive parallelism needs, these worksharing constructs are often not applicable. As a result, the OpenMP standard introduced the concept of tasks, which allow the user to specify units of work as tasks that are then scheduled to be executed on different threads. OpenMP also adds the possibility to define dependencies between sibling tasks. For example, if task 3 is a dependent task of tasks 1 and 2, it cannot be scheduled until both task 1 and 2 have completed. While tasks are a good way to deal with recursive and irregular parallelism, using them in an application is often more complex and needs more lines of code than a simple worksharing construct would take. Therefore, the OpenMP standard introduced the taskloop directive, which automatically partitions a for-loop into tasks. The taskloop construct so far does not support task dependencies but recent works provide first ideas how this could be added.

This seminar thesis should give an overview of the existing variants of task dependencies before discussing the possibilities to extend the taskloop directive to support task dependencies as well.

Kind of topic: dive-in
Supervisor: Isa Thärigen

Visualization of Data Movements and Accesses

The performance analysis of parallel applications grows more challenging every year. The data movements and access patterns of an application are an important part of such analyses, as they can heavily impact the performance. Visualization tools can help to identify such performance issues by providing a visual representation of the data accesses of an application.

This seminar thesis should compare two or more data access visualization approaches against each other and discuss their advantages and disadvantages.

Kind of topic: overview
Supervisor: Isa Thärigen

Supervisors & Organization

Semih Burak
Marc-André Hermanns
Jannis Klinkenberg
Anara Kozhokanova
Isa Thärigen

Kontakt

Isa Thärigen