Seminar Current Topics in High-Performance Computing

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.
Futhermore, we will introduce the students to "Scientific Writing" and "Scientific Presenting" in computer science. These two events are also compulsory in attendance.

The compulsory introductory event (kickoff) is scheduled for October 16th, 2023, 9am-11pm. The next compulsory meetings are planned for October 20th 1.00pm - 2.30pm, October 30th 2.30pm - 4pm, and December 15th 2.30pm - 4pm.

Furthermore, we plan to do the seminar as an in-person event. That means that you need to be personally present for all compulsory parts of the seminar.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Types of Topics

We provide two flavors of seminar topics depending on the particular topic: (a) overview topics, and (b) dive-in topics. It works as the names suggest. Nevertheless, this categorization does not necessarily imply a strict "either-or" but rather provides a guideline for addressing the topic. In general, both types of topics are equally difficult to work on. However, they have different challenges. In the topic list below, you can also find the corresponding categorizations for the seminar topic types.

Topics

Performance Evaluation of Next-Generation Vector Supercomputers

In High Performance Computing (HPC) vector architectures have quite a long tradition. Although the emergence of x86 commodity clusters interrupted this trend for a while, we see a renaissance of SIMD capabilities in all modern architectures. Not only the success of general GPU computation during the last decade, but also the trend to longer SIMD registers in common CPUs contributes to this development. Vendors of next-generation vector supercomputers like NEC promise to be competitive with Nvidia or modern ARM architectures in term of performance and energy efficiency.

This seminar article and talk is expected to give an overview of the principles of vector computing and how next-generation designs like the SX-Aurora TSUBASA VE30 improve the performance for basic and relevant industry benchmarks. This overview should include a discussion of the improvement in the VE30 architecture (e.g., private 3rd level cache, new instructions) as well as a comparison to other architecture like Fujitsu A64FX, Nvidia A100 or Intel x86. The seminar candidate has the opportunity to verify the state and the plausibility of the benchmarks on the first generation of SX-Aurora TSUBASA (V10).

Kind of topic: dive-in
Supervisor: Tim Cramer

Analyzing Resource Utilization in Huge HPC systems

For many domain scientist, computer simulations form (together with theory and experiment) the third pillar of science. Since the hardware resource are limited and costly, HPC centers have to ensure the efficient usage of the systems. Furthermore, analyzing and understanding the utilization of the various resources forms the basis for purchasing decisions in future procurements. One challenge here is to combine the multitude of HPC resources and applications into an overall picture.

This seminar article and talk is expected to give an overview on metrics and measurements, which help HPC centers to assess the overall utilization of their systems. This includes metric for CPU, GPU and memory utilization. Furthermore, the thesis has to compare existing case studies from other supercomputers like NERSC`s Perlmutter or Oak Ridge`s Titan. The seminar candidate may use the performance monitoring system of RWTH`s CLAIX supercomputer in order to compare the possibilities with other huge HPC centers.

Kind of topic: dive-in
Supervisor: Tim Cramer

Strategies for Optimizing OpenMP Target Offloading in Applications

An increasing number of applications adopt OpenMP target offloading to use GPUs for computation. Different techniques of optimization can be necessary to work around different performance bottlenecks. These techniques involve fusion of target kernels, optimization of data allocation and data transfers.

This seminar thesis will provide an overview of different optimization strategies used for OpenMP target offloading.

Kind of topic: overview
Supervisor: Joachim Jenke

Compiler Techniques for Improving OpenMP Target Offloading

OpenMP target offloading often does not reach the performance of optimized CUDA kernels. The LLVM/CLang compiler just recently introduced techniques like just-in-time compilation and link-time optimization for device code with the goal of improving the performance of OpenMP target offloading.

This seminar thesis will motivate and present the different compiler techniques used to optimize OpenMP target offloading. Furthermore, the thesis should evaluate the impact of these techniques using different OpenMP offloading applications.

Kind of topic: dive-in
Supervisor: Joachim Jenke

Energy Consumption Characterization for HPC Applications (English only)

Energy efficiency is one of the current important topics in the HPC due to the rise of climate concerns. When devising an energy optimization strategy for HPC applications, the strategy is often tailored for specific applications since HPC applications can have different energy consumption patterns based on many aspects, such as computation pattern (compute-bound, memory-bound, or i/o bound), dependency components, and deployment architecture. Because of that, energy characterization is a crucial step in the energy optimization work. A lot of considerations need to be taken care of in energy characterization with multiple tools and state-of-the-art practices that can be explored to produce a holistic understanding of the application's energy consumption.

In this seminar, students will conduct a literature study to evaluate current methodologies used for energy characterization depending on the type of HPC applications and corresponding infrastructure. Optionally, this seminar project also offers a hands-on project to characterize the energy consumption pattern of several applications in NHR4CES benchmarks. The student can run an energy analysis experiment based on their literature study.

Kind of topic: dive-in/overview
Supervisor: Radita Liem

Benchmarking for HDF5, What to Expect? (English only)

HDF5 is one of the most popular high-level parallel I/O libraries commonly used to handle I/O operations in scientific applications due to its portability and flexibility. However, the benefits offered by HDF5 do not always mean the performance portability of the applications. This is due to the system and infrastructure configuration differences that make further tuning is still needed. A benchmark is usually needed to get some idea about the baseline performance that can be achieved by certain systems/setups.

In this seminar thesis, the student needs to deep-dive into the existing HDF5 benchmarks and what these benchmark offer to provide an understanding of the HDF5 performance. Literature studies that look into other I/O benchmarking practices and methodologies are also needed to discuss what is currently lacking from the existing HDF5 benchmarks and why an HDF5-specific benchmark is needed. Optionally, this seminar also offers hands-on experience in reproducing the reference paper to get a more in-depth understanding of the topic.

Kind of topic: dive-in/overview
Supervisor: Radita Liem

The Future of Scientific Code Coupling

Traditional multi-scale (e.g. reactive flow simulation) or multi-physics (e.g. fluid-structure interaction) problems often require coupling of multiple different scientific codes which leads to challenge of handling communication between the codes efficiently at large scale. Multiple libraries were developed to ease this problem, e.g. the Multiscale Universal Interface, preCICE or MUSCLE. Today, novel technologies from artificial intelligence are applied and investigated to solve complex problems across many different domain sciences. While many traditional simulation codes are mostly optimized to run efficiently on CPU architectures, many AI models may be trained and inferred very efficiently by hardware accelerators such as GPUs or specialized devices like Tensor Processing Units (TPUs) or Intelligence Processing Units (IPUs). Exploiting this heterogeneous hardware landscape is an HPC challenge. As a result new coupling libraries specifically designed towards deploying AI models in HPC simulation codes are developed, for example NNPred. At the chair for high-performance computing we are also developing a library for this purpose.

In this thesis, the student should give an overview about traditional coupling libraries as well as novel libraries to deploy AI models into HPC simulation codes and point out HPC challenges addressed by these libraries. The student should further compare the traditional and novel approaches discussing similarities, differences and limitations.

Kind of topic: overview
Supervisor: Fabian Orland

Deep Learning-Based Adaptive Mesh Refinement Techniques

Applications of deep learning (DL) models to HPC problems are becoming increasingly more popular. In the field of simulating turbulent (reactive) flows direct numerical simulations on highly resolved meshes are computationally not feasible for complex scenarios. Instead, large eddy simulations on coarser meshes are computationally feasible but introduce the problem of unclosed terms in the governing filtered Navier-Stokes equations requiring subgrid-scale models to close the equations. Recent studies investigated DL super-resolution with generative adversarial networks for subgrid-scale modeling. However, uniform super-resolution of the simulation mesh might lead to unnecessary computational overhead in regions, where physical quantities do not change much. Hence, recent studies also investigate DL super-resolution (e.g. NUNet) or graph convolutional neural networks (e.g. GMR-Net) to produce high-quality non-uniform meshes.

In this thesis, the student should first present the state-of-the-art of conventional mesh refinement methods also highlighting limitations of these techniques. Second, the student should present and discuss novel ideas of using DL techniques for adaptive mesh refinement.

Kind of topic: dive-in
Supervisor: Fabian Orland

Efficient Intra-Kernel Communication on GPUs with OpenSHMEM

Using GPUs to solve highly data-parallel problems such as matrix operations is becoming increasingly popular in the HPC area. Scaling such calculations to run on clusters with thousands of GPUs is a common use case. For computation on GPUs, CUDA or ROCm is used, whereas for the communication of data between the GPUs, the Message Passing Interface (MPI) is utilized. However, MPI communication requires involvement of the CPU which unnecessary adds additional latencies. Further, it is not possible to directly move data from one GPU to another GPU within a GPU kernel. Recent approaches such as NVSHMEM by NVIDIA and ROC_SHMEM by AMD implement the PGAS programming model OpenSHMEM on GPUs. Both allow GPU-to-GPU communication within GPU kernels using one-sided communication calls and thereby completely bypass the CPU. This can significantly reduce the amount CPU-GPU-synchronization and improve the computation-communication overlap in certain application codes.

In this seminar thesis, the student should discuss how the utilization of OpenSHMEM on GPUs can improve the scalability of large-scale GPU compute kernels compared to the traditional MPI communication model. Further, the student should present the two different approaches NVSHMEM and ROC_SHMEM and discuss architectural differences and similarities. Optionally, some experiments with simple GPU kernels might be performed on the RWTH CLAIX cluster.

Kind of topic: dive-in
Supervisor: Simon Schwitanski

An Evaluation of Library-Based Partitioned Global Address Space Models

Partitioned Global Address Space (PGAS) programming models aim to increase the developer productivity and performance of parallel applications: They provide an abstraction for distributed-memory systems that allows developers to access distributed data in a single address space. In traditional message passing models (such as MPI), data has to be exchanged explicitly in messages, whereas in PGAS models, the transfer is realized using remote memory access primitives or is even completely transparent to the user. There are different approaches to implement such a PGAS model: There are PGAS languages (UPC, Coarray Fortran), directive-based approaches (XMP), and libraries (UPC++, OpenSHMEM, GASPI, MPI RMA) on different abstraction levels targeting different purposes.

In this thesis, the student should do a systematic literature review of library-based PGAS programming models and their use cases. This includes a short presentation of a selected number of approaches as well as a comparison regarding semantics, usability, performance, and productivity. Optionally, the thesis might include experiments on the RWTH CLAIX cluster.

Kind of topic: overview
Supervisor: Simon Schwitanski

Regression Testing Frameworks for HPC Systems

Whether to check the operational status of an HPC system after a hardware or software update, or monitor its performance over time, regression tests are important for maintaining an HPC system. In the past, a number of frameworks for regression tests have emerged to ease the implementation of new tests, improve maintainability of existing tests and increase portability of tests. The currently available frameworks vary in their approach, their focus on type of tests and their richness of the already included test suite.

In this thesis, the student should first present an overview of different frameworks for regression tests. Additionally, the presented frameworks shall be compared regarding supported type of tests (diagnostic vs. benchmarking etc.) as well as already included tests and the extendibility. A subsequent discussion should make suggestions on which framework to use for a specific purpose (checking the operational state after an update versus monitoring the performance over time).

Kind of topic: overview
Supervisor: Felix Tomski

Towards Dynamic Resource Management with MPI Sessions and PMIx

One important factor limiting the throughput of today’s high-performance compute clusters is the static nature of resource allocations in job schedulers. Dynamic resource management for MPI applications has been studied extensively in the past to overcome these limitations. MPI Sessions---a new MPI feature introduced with MPI 4.0---gave rise to new approaches in the field of dynamic resource management.

The thesis should discuss the approach to dynamic resource management using MPI Sessions as presented in the starting paper. Additionally, other approaches either utilizing MPI Sessions as well or other mechanisms may be covered (on a high-level) and compared with the one of the starting paper.

Kind of topic: dive-in
Supervisor: Felix Tomski

Supervisors & Organization

Tim Cramer
Joachim Jenke
Radita Liem
Fabian Orland
Simon Schwitanski
Isa Thärigen
Felix Tomski

Contact

Isa Thärigen