Topics
Performance Evaluation of Next-Generation Vector Supercomputers
In High Performance Computing (HPC) vector architectures have quite a long tradition. Although the emergence of x86 commodity clusters interrupted this trend for a while, we see a renaissance of SIMD capabilities in all modern architectures. Not only the success of general GPU computation during the last decade, but also the trend to longer SIMD registers in common CPUs contributes to this development. Vendors of next-generation vector supercomputers like NEC promise to be competitive with Nvidia or modern ARM architectures in term of performance and energy efficiency.
This seminar article and talk is expected to give an overview of the principles of vector computing and how next-generation designs like the SX-Aurora TSUBASA VE30 improve the performance for basic and relevant industry benchmarks. This overview should include a discussion of the improvement in the VE30 architecture (e.g., private 3rd level cache, new instructions) as well as a comparison to other architecture like Fujitsu A64FX, Nvidia A100 or Intel x86. The seminar candidate has the opportunity to verify the state and the plausibility of the benchmarks on the first generation of SX-Aurora TSUBASA (V10).
Kind of topic: dive-in
Supervisor: Tim Cramer
Analyzing Resource Utilization in Huge HPC systems
For many domain scientist, computer simulations form (together with theory and experiment) the third pillar of science. Since the hardware resource are limited and costly, HPC centers have to ensure the efficient usage of the systems. Furthermore, analyzing and understanding the utilization of the various resources forms the basis for purchasing decisions in future procurements. One challenge here is to combine the multitude of HPC resources and applications into an overall picture.
This seminar article and talk is expected to give an overview on metrics and measurements, which help HPC centers to assess the overall utilization of their systems. This includes metric for CPU, GPU and memory utilization. Furthermore, the thesis has to compare existing case studies from other supercomputers like NERSC`s Perlmutter or Oak Ridge`s Titan. The seminar candidate may use the performance monitoring system of RWTH`s CLAIX supercomputer in order to compare the possibilities with other huge HPC centers.
Kind of topic: dive-in
Supervisor: Tim Cramer
Strategies for Optimizing OpenMP Target Offloading in Applications
An increasing number of applications adopt OpenMP target offloading to use GPUs for computation. Different techniques of optimization can be necessary to work around different performance bottlenecks. These techniques involve fusion of target kernels, optimization of data allocation and data transfers.
This seminar thesis will provide an overview of different optimization strategies used for OpenMP target offloading.
Kind of topic: overview
Supervisor: Joachim Jenke
Compiler Techniques for Improving OpenMP Target Offloading
OpenMP target offloading often does not reach the performance of optimized CUDA kernels. The LLVM/CLang compiler just recently introduced techniques like just-in-time compilation and link-time optimization for device code with the goal of improving the performance of OpenMP target offloading.
This seminar thesis will motivate and present the different compiler techniques used to optimize OpenMP target offloading. Furthermore, the thesis should evaluate the impact of these techniques using different OpenMP offloading applications.
Kind of topic: dive-in
Supervisor: Joachim Jenke
Energy Consumption Characterization for HPC Applications (English only)
Energy efficiency is one of the current important topics in the HPC due to the rise of climate concerns. When devising an energy optimization strategy for HPC applications, the strategy is often tailored for specific applications since HPC applications can have different energy consumption patterns based on many aspects, such as computation pattern (compute-bound, memory-bound, or i/o bound), dependency components, and deployment architecture. Because of that, energy characterization is a crucial step in the energy optimization work. A lot of considerations need to be taken care of in energy characterization with multiple tools and state-of-the-art practices that can be explored to produce a holistic understanding of the application's energy consumption.
In this seminar, students will conduct a literature study to evaluate current methodologies used for energy characterization depending on the type of HPC applications and corresponding infrastructure. Optionally, this seminar project also offers a hands-on project to characterize the energy consumption pattern of several applications in NHR4CES benchmarks. The student can run an energy analysis experiment based on their literature study.
Kind of topic: dive-in/overview
Supervisor: Radita Liem
Benchmarking for HDF5, What to Expect? (English only)
HDF5 is one of the most popular high-level parallel I/O libraries commonly used to handle I/O operations in scientific applications due to its portability and flexibility. However, the benefits offered by HDF5 do not always mean the performance portability of the applications. This is due to the system and infrastructure configuration differences that make further tuning is still needed. A benchmark is usually needed to get some idea about the baseline performance that can be achieved by certain systems/setups.
In this seminar thesis, the student needs to deep-dive into the existing HDF5 benchmarks and what these benchmark offer to provide an understanding of the HDF5 performance. Literature studies that look into other I/O benchmarking practices and methodologies are also needed to discuss what is currently lacking from the existing HDF5 benchmarks and why an HDF5-specific benchmark is needed. Optionally, this seminar also offers hands-on experience in reproducing the reference paper to get a more in-depth understanding of the topic.
Kind of topic: dive-in/overview
Supervisor: Radita Liem
The Future of Scientific Code Coupling
Traditional multi-scale (e.g. reactive flow simulation) or multi-physics (e.g. fluid-structure interaction) problems often require coupling of multiple different scientific codes which leads to challenge of handling communication between the codes efficiently at large scale. Multiple libraries were developed to ease this problem, e.g. the Multiscale Universal Interface, preCICE or MUSCLE. Today, novel technologies from artificial intelligence are applied and investigated to solve complex problems across many different domain sciences. While many traditional simulation codes are mostly optimized to run efficiently on CPU architectures, many AI models may be trained and inferred very efficiently by hardware accelerators such as GPUs or specialized devices like Tensor Processing Units (TPUs) or Intelligence Processing Units (IPUs). Exploiting this heterogeneous hardware landscape is an HPC challenge. As a result new coupling libraries specifically designed towards deploying AI models in HPC simulation codes are developed, for example NNPred. At the chair for high-performance computing we are also developing a library for this purpose.
In this thesis, the student should give an overview about traditional coupling libraries as well as novel libraries to deploy AI models into HPC simulation codes and point out HPC challenges addressed by these libraries. The student should further compare the traditional and novel approaches discussing similarities, differences and limitations.
Kind of topic: overview
Supervisor: Fabian Orland
Deep Learning-Based Adaptive Mesh Refinement Techniques
Applications of deep learning (DL) models to HPC problems are becoming increasingly more popular. In the field of simulating turbulent (reactive) flows direct numerical simulations on highly resolved meshes are computationally not feasible for complex scenarios. Instead, large eddy simulations on coarser meshes are computationally feasible but introduce the problem of unclosed terms in the governing filtered Navier-Stokes equations requiring subgrid-scale models to close the equations. Recent studies investigated DL super-resolution with generative adversarial networks for subgrid-scale modeling. However, uniform super-resolution of the simulation mesh might lead to unnecessary computational overhead in regions, where physical quantities do not change much. Hence, recent studies also investigate DL super-resolution (e.g. NUNet) or graph convolutional neural networks (e.g. GMR-Net) to produce high-quality non-uniform meshes.
In this thesis, the student should first present the state-of-the-art of conventional mesh refinement methods also highlighting limitations of these techniques. Second, the student should present and discuss novel ideas of using DL techniques for adaptive mesh refinement.
Kind of topic: dive-in
Supervisor: Fabian Orland
Efficient Intra-Kernel Communication on GPUs with OpenSHMEM
Using GPUs to solve highly data-parallel problems such as matrix operations is becoming increasingly popular in the HPC area. Scaling such calculations to run on clusters with thousands of GPUs is a common use case. For computation on GPUs, CUDA or ROCm is used, whereas for the communication of data between the GPUs, the Message Passing Interface (MPI) is utilized. However, MPI communication requires involvement of the CPU which unnecessary adds additional latencies. Further, it is not possible to directly move data from one GPU to another GPU within a GPU kernel. Recent approaches such as NVSHMEM by NVIDIA and ROC_SHMEM by AMD implement the PGAS programming model OpenSHMEM on GPUs. Both allow GPU-to-GPU communication within GPU kernels using one-sided communication calls and thereby completely bypass the CPU. This can significantly reduce the amount CPU-GPU-synchronization and improve the computation-communication overlap in certain application codes.
In this seminar thesis, the student should discuss how the utilization of OpenSHMEM on GPUs can improve the scalability of large-scale GPU compute kernels compared to the traditional MPI communication model. Further, the student should present the two different approaches NVSHMEM and ROC_SHMEM and discuss architectural differences and similarities. Optionally, some experiments with simple GPU kernels might be performed on the RWTH CLAIX cluster.
Kind of topic: dive-in
Supervisor: Simon Schwitanski
An Evaluation of Library-Based Partitioned Global Address Space Models
Partitioned Global Address Space (PGAS) programming models aim to increase the developer productivity and performance of parallel applications: They provide an abstraction for distributed-memory systems that allows developers to access distributed data in a single address space. In traditional message passing models (such as MPI), data has to be exchanged explicitly in messages, whereas in PGAS models, the transfer is realized using remote memory access primitives or is even completely transparent to the user. There are different approaches to implement such a PGAS model: There are PGAS languages (UPC, Coarray Fortran), directive-based approaches (XMP), and libraries (UPC++, OpenSHMEM, GASPI, MPI RMA) on different abstraction levels targeting different purposes.
In this thesis, the student should do a systematic literature review of library-based PGAS programming models and their use cases. This includes a short presentation of a selected number of approaches as well as a comparison regarding semantics, usability, performance, and productivity. Optionally, the thesis might include experiments on the RWTH CLAIX cluster.
Kind of topic: overview
Supervisor: Simon Schwitanski
Regression Testing Frameworks for HPC Systems
Whether to check the operational status of an HPC system after a hardware or software update, or monitor its performance over time, regression tests are important for maintaining an HPC system. In the past, a number of frameworks for regression tests have emerged to ease the implementation of new tests, improve maintainability of existing tests and increase portability of tests. The currently available frameworks vary in their approach, their focus on type of tests and their richness of the already included test suite.
In this thesis, the student should first present an overview of different frameworks for regression tests. Additionally, the presented frameworks shall be compared regarding supported type of tests (diagnostic vs. benchmarking etc.) as well as already included tests and the extendibility. A subsequent discussion should make suggestions on which framework to use for a specific purpose (checking the operational state after an update versus monitoring the performance over time).
Kind of topic: overview
Supervisor: Felix Tomski
Towards Dynamic Resource Management with MPI Sessions and PMIx
One important factor limiting the throughput of today’s high-performance compute clusters is the static nature of resource allocations in job schedulers. Dynamic resource management for MPI applications has been studied extensively in the past to overcome these limitations. MPI Sessions---a new MPI feature introduced with MPI 4.0---gave rise to new approaches in the field of dynamic resource management.
The thesis should discuss the approach to dynamic resource management using MPI Sessions as presented in the starting paper. Additionally, other approaches either utilizing MPI Sessions as well or other mechanisms may be covered (on a high-level) and compared with the one of the starting paper.
Kind of topic: dive-in
Supervisor: Felix Tomski
Supervisors & Organization
Tim Cramer
Joachim Jenke
Radita Liem
Fabian Orland
Simon Schwitanski
Isa Thärigen
Felix Tomski