Topics
Integrating Quantum Computing into HPC
For the purpose of reaching more and more performant systems for HPC use cases, new hardware components or accelerators are introduced steadily to the HPC landscape. They serve the purpose of allowing a more energy-efficient processing of specific workloads compared to CPUs. Therefore, accelerator-combined systems can reach higher performance with less power consumption compared to CPU-only systems if utilized correctly. While GPUs are already common in new HPC systems, recently there is a wide variety of research around accelerators such as FPGAs, vector engines, or quantum computers. Quantum computers enable the computation of certain workloads at a high number of orders of magnitude faster than conventional computing systems. Although (theoretical) quantum computing is an active research field since decades, recent advances in quantum hardware systems gave the motivation to start thinking about how to use quantum computer in HPC.
The seminar thesis should elaborate on the integration of quantum computing into conventional HPC, both system-wise and programming-model-wise. This includes discussing current challenges with possible solutions. It should also examine other methods of integrating quantum computers besides as accelerators. Possibly, concrete performance advantages on HPC use cases should be demonstrated.
Kind of topic: overview
Supervisor: Semih Burak
Static Power Consumption Prediction based on Machine Learning Models
HPC reached a point where significantly higher performance can only be reached by more energy-efficient computing. This endeavor led to a trend of introducing a variety of accelerators that can compute certain workloads not only faster but also with less energy compared to classical CPUs, e.g. GPUs or FPGAs. While the now increasingly heterogeneous HPC system has the capability of being more performant for a given power envelope, this capability is only made use of if the parallel workloads are mapped to the hardware units correctly. Correctly means that both the performance and power consumption needs to be considered so that an execution is obtained that has an optimal trade-off between performance and energy. Such allocation decisions require a power model that can predict power consumption relatively accurately for the given workload and hardware system. This will allow, for example, to choose those frequencies or running levels for the hadrware units.
This seminar thesis should elaborate on static models that can predict the energy consumption without having run the given application once. Specifically, machine learning models which take certain information from the application code and hardware characteristics into account should be analyzed and their ad- and disadvantages compared to other static (or dynamic) approaches. Possibly, the seminar thesis should sketch how to make use of solely static concepts to statically choose the energy-minimizing running levels for the parallel hardware units.
Kind of topic: overview
Supervisor: Semih Burak
NVIDIA Grace Hopper Systems: Assessing CPU+GPU on a Single Board by Performance and Energy
HPC reached a point where significantly higher performance can only be reached by more energy-efficient computing. This endeavor led to a trend of introducing a variety of accelerators that can compute certain workloads not only faster but also with less energy compared to classical CPUs, e.g. GPUs or FPGAs. In current HPC systems, the configuration of CPU and GPU with each having its own memory is getting popular. Often the slow link between CPU and GPU memory can be a performance bottleneck. With the aim of reducing the overhead induced by the necessary data transfers between GPU and CPU memory, new hardware architectures are introduced. One of them is the new NVIDIA Grace Hopper system that links CPU and GPU with NVIDIAs fast NVLINK on a single board to deliver a CPU+GPU coherent memory model.
This seminar thesis should elaborate on the performance and energy-efficiency ad- and disadvantages of this new system, after having analyzed it thoroughly. Particularly, a comparison should be made to the conventional CPU+GPU model.
Kind of topic: dive-in
Supervisor: Semih Burak
Partitioned Collective Communication
Partitioned point-to-point communication was introduced with MPI 4.0. It enables applications to indicate and query partial completion of specially initiated communications. It can be used to reduce the number of messages to be matched or mitigate slight imbalances when individual threads or tasks prepare individual parts of a larger message. After initial introduction for point-to-point communication in MPI 4.0, work was started to extend the concept to collective communication as well, both for symmetry's sake as well as to enable new application use cases and optimization potential.
This seminar thesis should take a close look at the proposed up-coming partitioned collective communication interface, its specific semantic behavior, as well as discussing application scenarios where the proposed interface can be beneficial.
Kind of topic: dive-in
Supervisor: Marc-André Hermanns
Modern C++ Language Binding for MPI
As the official C++ binding for the Message Passing Interface (MPI) have been deprecated and removed from the MPI standard, C++ application developers have to fall back on the C bindings to enable MPI communication in their applications. As especially modern C++ capabilities and practices have come a long way from the common root with C, applications developers have also strived to create abstraction libraries to enable the use of MPI functionality within a modern C++ context.
This seminar thesis should look into different approaches of those libraries, their similarities, their differences, how they enable a modern C++ application development and how far the level of support for all MPI functionalities has grown so far.
Kind of topic: overview
Supervisor: Marc-André Hermanns
Evaluation of Remote Device Offloading
Most of current HPC systems nowadays consist heterogeneous compute nodes equipped with 2 or more multi-core CPUs as well as several accelerators (e.g., GPUs) that can be used to speed up computations. Especially codes with compute-bound hotspots/kernels that are embarrassingly parallel or close to it can profit from offloading the workload to e.g., a GPUs. In the last years, OpenMP, the de-factor standard for shared-memory parallelization, has been extended with device offloading functionalities. However, as the workload, complexity and resolution of scientific and industrial applications is continuously growing, a single compute node in an HPC cluster and its accelerators might not suffice. Recent efforts investigated remote offloading to distribute work also to accelerators on different machines and reduce the time to solution.
In this seminar thesis, the student should first investigate and compare recent approaches for remote offloading. Further, the goal is to evaluate the performance and capabilities of the currently proposed OpenMP remote offloading extensions. If possible and desired, experiments with remote offloading can be carried out on the RWTH cluster.
Kind of topic: overview
Supervisor: Jannis Klinkenberg
Investigating Task-based Programming Frameworks for Distributed Memory Systems
In the past, scientific and industrial applications mainly consisted of well-balanced and regular workloads. Those workloads allowed a straightforward domain decomposition to evenly distribute work across compute nodes and the use of simple work-sharing techniques to evenly distribute the work across cores/threads within a compute node. Over the last decades and years, the complexity of and variety in applications has significantly increased. Applications employ recursive and irregular workloads where the work to perform might differ between processing units or can evolve/change over time leading to the fact that classical domain decomposition and work-sharing do not suffice to efficiently parallelize the application and ensure a proper load balance. Consequently, task-based programming paradigms were introduced - first for shared memory systems - that allow a more flexible specification of work packages and even dependencies between those. Several emerging frameworks aim to bring the task-based programming paradigm also to applications running on distributed memory systems by applying different methodologies and concepts. Examples among others are Charm++, Legion, Chameleon and TaskTorrent.
In this seminar thesis, the student should investigate the capabilities and fundamental conceptual differences of various task-based programming solutions for distributed memory. Further, strength and weaknesses of the approaches with respect to characteristics like scheduling, load balancing and dependencies between tasks should be discussed. Finally, some performance measurements for each approach should be presented.
Kind of topic: overview
Supervisor: Jannis Klinkenberg
Efficient Data Mapping in Systems with Heterogeneous Memory
Over the last decades, the architectural enhancements in high performance computing resulted in a constant increase in computational power of such machines by increasing core frequency, improving the manufacturing process and introducing multi- and many-core CPUs. In contrast, enhancements on the memory side - most systems use classical DRAM - did not equally keep up with that development resulting in a gap between computational power as well as memory latency and memory bandwidth. Consequently, several emerging memory technologies have come up with intending to close that gap by tackling challenges of DRAM. Some examples are High Bandwidth Memory (HBM and HBM2) that provides higher bandwidth but smaller memory capacity and Non-Volatile Memory (NVM) that has a larger capacity than DRAM but higher latency and lower bandwidth. Compute nodes of the next generation of super computers will most likely be equipped with a heterogeneous memory system combining classical DRAM and one or more of the aforementioned technologies. This poses new questions, e.g., where to place data and when to move data between different memory types to efficiently use resources and reduce power consumption and execution time.
In this seminar thesis, the student should present an overview of approaches for managing the data placement and movement of data between different kinds of memory. Further, the seminar thesis should discuss the design choices tackling the aforementioned questions and illustrate performance results for scientific applications.
Kind of topic: overview
Supervisor: Jannis Klinkenberg
Distributed Network Topologies for Large-Scale ML/DL Workloads (English only)
Large-scale Machine Learning (ML) and Deep Learning (DL) workloads are becoming ubiquitous due to the recent advances in artificial intelligence, the availability of large data sets, and improvements in hardware architectures. HPC clusters, nowadays, are fully subscribed with ML/DL jobs. To parallelize such jobs, a number of multi-dimensional communication distribution strategies have been developed. Understanding why different communication patterns favor certain topologies is essential for efficient computing and optimal utilization of computing resources.
In this seminar thesis, a student will review distributed network topologies for ML/DL applications, provide an analysis of their performance and discuss the pros and cons of each approach.
Kind of topic: overview
Supervisor: Anara Kozhokanova
Automatic Transformation of Blocking into Non-Blocking MPI Communication
The Message Passing Interface (MPI) is the de-facto standard for message passing and communication between compute nodes. It defines the concept of blocking and non-blocking communication. In blocking directives, the calling process has to wait until the communication finishes. In contrast to that a non-blocking communication consists of two parts, an initialization and a completion call, which define a time window in which the communication has to occur. In between these two calls, the calling process can perform other operations, for example, calculations on buffers unrelated to the communication. As the introduction of non-blocking communication in existing or new codes comes with new challenges like race conditions, the usage of non-blocking communication in legacy applications is not widespread. To simplify the usage and support the user, different works researched the possibility of automatically transforming blocking into non-blocking communication.
This seminar thesis should present the most recent approaches for automatic transformation of blocking into non-blocking MPI communication and compare them against previous solutions.
Kind of topic: dive-in
Supervisor: Isa Thärigen
OpenMP Taskloop Dependences
OpenMP is a well-known standard for shared-memory parallel computing. One of the easiest ways to parallelize an application with OpenMP is to make use of worksharing constructs like omp parallel for, which parallelizes a for-loop for the user. However, for applications with irregular or recursive parallelism needs, these worksharing constructs are often not applicable. As a result, the OpenMP standard introduced the concept of tasks, which allow the user to specify units of work as tasks that are then scheduled to be executed on different threads. OpenMP also adds the possibility to define dependencies between sibling tasks. For example, if task 3 is a dependent task of tasks 1 and 2, it cannot be scheduled until both task 1 and 2 have completed. While tasks are a good way to deal with recursive and irregular parallelism, using them in an application is often more complex and needs more lines of code than a simple worksharing construct would take. Therefore, the OpenMP standard introduced the taskloop directive, which automatically partitions a for-loop into tasks. The taskloop construct so far does not support task dependencies but recent works provide first ideas how this could be added.
This seminar thesis should give an overview of the existing variants of task dependencies before discussing the possibilities to extend the taskloop directive to support task dependencies as well.
Kind of topic: dive-in
Supervisor: Isa Thärigen
Visualization of Data Movements and Accesses
The performance analysis of parallel applications grows more challenging every year. The data movements and access patterns of an application are an important part of such analyses, as they can heavily impact the performance. Visualization tools can help to identify such performance issues by providing a visual representation of the data accesses of an application.
This seminar thesis should compare two or more data access visualization approaches against each other and discuss their advantages and disadvantages.
Kind of topic: overview
Supervisor: Isa Thärigen
Supervisors & Organization
Semih Burak
Marc-André Hermanns
Jannis Klinkenberg
Anara Kozhokanova
Isa Thärigen