Seminar Current Topics in High-Performance Computing

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.
Futhermore, we will introduce the students to "Scientific Writing" and "Scientific Presenting" in computer science. These two events are also compulsory in attendance.

The compulsory introductory event (kickoff) is scheduled for Tuesday, April 8, 10:30 a.m-12:30 p.m. The next compulsory meetings are planned for TBA.

Furthermore, the seminar is an in-person event. That means that you need to be personally present for all compulsory parts of the seminar.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Types of Topics

We provide two flavors of seminar topics depending on the particular topic: (a) overview topics, and (b) dive-in topics. It works as the names suggest. Nevertheless, this categorization does not necessarily imply a strict "either-or" but rather provides a guideline for addressing the topic. In general, both types of topics are equally difficult to work on. However, they have different challenges. In the topic list below, you can also find the corresponding categorizations for the seminar topic types.

Topics

Suitable Performance Optimization Approaches for the SPMD IR

In high-performance computing (HPC), the compute cluster systems get increasingly larger and their architecture more and more heterogeneous (multiple compute nodes with CPUs, GPUs, …). This has the aim of satisfying the ever-growing computational demands (for example, from large simulations or AI models) that led to multiple different parallel programming models. These programming models are used in addition to programming languages to implement software that makes use of the fast and parallel HPC hardware efficiently and effectively. To ease the development of tools and compiler passes for parallel programming models following the single program, multiple data (SPMD) principle, the SPMD IR (intermediate representation) was introduced. It approches the problem that often tools only support one model or have implemented the necessary abstraction internally leading to a limited extensibility and reusability. The SPMD IR's prototype is implemented in MLIR/LLVM and supports MPI, NCCL, and SHMEM. While its usefulness is shown for the verification of collective communication, the question remains which of the performance optimizing approaches or compiler passes in modern compiler systems can make use of it and are suitable and which are not applicable per-design.

This seminar thesis is supposed to conduct a systematic literature review of approaches that conduct performance optimization in the contexct of any of the supported programing models or are part of LLVM or GCC as compiler passes and could be leveraged by the additional information given by the SPMD IR. After analyzing and understanding the SPMD IR, the student is supposed to give an overview of the found approaches and discuss their applicability to the SPMD IR.

Kind of topic: overview
Supervisor: Semih Burak

Suitable Correctness Verification Approaches for the SPMD IR

In high-performance computing (HPC), the compute cluster systems get increasingly larger and their architecture more and more heterogeneous (multiple compute nodes with CPUs, GPUs, …). This has the aim of satisfying the ever-growing computational demands (for example, from large simulations or AI models) that led to multiple different parallel programming models. These programming models are used in addition to programming languages to implement software that makes use of the fast and parallel HPC hardware efficiently and effectively. To ease the development of tools and compiler passes for parallel programming models following the single program, multiple data (SPMD) principle, the SPMD IR (intermediate representation) was introduced. It approches the problem that often tools only support one model or have implemented the necessary abstraction internally leading to a limited extensibility and reusability. The SPMD IR's prototype is implemented in MLIR/LLVM and supports MPI, NCCL, and SHMEM. While its usefulness is shown for the verification of collective communication, the question remains which of the correctness verification approaches or compiler passes in modern compiler systems can make use of it and are suitable and which are not applicable per-design.

This seminar thesis is supposed to conduct a systematic literature review of approaches that conduct correctness verification in the contexct of any of the supported programing models or are part of LLVM or GCC as compiler passes and could be leveraged by the additional information given by the SPMD IR. After analyzing and understanding the SPMD IR, the student is supposed to give an overview of the found approaches and discuss their applicability to the SPMD IR.

Kind of topic: overview
Supervisor: Semih Burak

Machine and Reinforcement Learning Techniques for Data Placement in Multi-Tier Main Memory Systems

In modern computing, the performance gap between compute and memory continues to widen, especially in multi-core and accelerated systems. To address this, memory subsystems are evolving with new technologies like high-bandwidth memory (HBM) alongside traditional DDR, and byte-addressable non-volatile memory (NVM), creating heterogeneous memory systems. These systems offer trade-offs between capacity and speed, making optimal data allocation across memory tiers a key research challenge.

Various approaches tackle this problem, including runtime systems for data placement, kernel-level solutions for automatic memory tiering, and more recently, machine learning (ML) and reinforcement learning (RL) techniques to predict future memory access patterns and guide data placement and movement. This seminar thesis should compare such approaches, focusing on ML and RL-based methods, explaining how they function as well as assessing their efficiency and providing a critical evaluation.

Kind of topic: dive-in
Supervisor: Jannis Klinkenberg

Improved dynamic task scheduling on heterogeneous CPU-GPU architectures

Modern HPC setups often feature heterogeneous architectures, combining multi-core CPUs with multiple accelerators like GPGPUs to enhance specific simulation or program components. As these systems and entire HPC infrastructures grow increasingly complex, performance varies depending on data location and access patterns across computational units. Efficient use of heterogeneous systems is essential for scientists and industry to run computationally intensive applications. Task-based programming has proven effective for leveraging such systems, but achieving high performance requires careful scheduling that accounts for data locality, affinity, load balancing, and the individual properties and capabilities of the different compute units.

This seminar thesis should review existing approaches to these challenges, exploring heuristics and scheduling techniques in detail. It should also present performance results and provide a critical evaluation of the methodologies.

Kind of topic: overview/dive-in
Supervisor: Jannis Klinkenberg

Storage and Memory Tiering for Data-Intensive HPC Applications

In order to accomodate the large data needs of modern machine learning workloads, large amounts of system memory are required. Since fast memory modules are very expensive when compared to hard drives or even SSDs, it is not cost-effective to provide the needed capacity as system memory. In order to tackle this problem, a tiered storage system is proposed, that automatically manages data stored on different architectures within an HPC cluster, such as memory, local hard drives and parallel file systems.

The seminar thesis should contain an overview of storage technologies in HPC, including their cost and performance characteristics. The tiering approach is to be evaluated in terms of ease-of-use and performance, with provided per-formance numbers that should be critically judged. There is an opportunity to evaluate the benefits of the new technology on the CLAIX supercomputer, by leveraging the different available storage systems.

Kind of topic: dive-in
Supervisor: Martin Philipp

Managing Data Post-Processing on Contested Systems

HPC worklaods produce large amounts of data that needs to be post-processed and analysed. It is often not known beforehand how much analysis is needed, since the analysis will change and evolve depending on the data. The post-processing steps are therefore interactive and often performed on shared systems, which im-plies interference from other users. While HPC systems often contain multiple storage systems of differing speeds, the fastest storage system might not always be the best to use, due to that interference. This approach introduces a model for interference and chooses file systems based on expected interfer-ence and performance to increase the overall I/O speed.

The seminar thesis is expected to explain the hierarchical storage model and the interference simulation presented in the source work. Optionally, the approach can be evaluated with regards to its applicability on CLAIX, including interfer-ence measurements.

Kind of topic: dive-in
Supervisor: Martin Philipp

Collectives and Communication for Multi-GPU ML Training

The training of machine learning models also arrived in the HPC sector, leading to various models being trained on classical HPC systems. Distributed machine learning across multiple nodes and GPUs imposes special requirements on communication and its performance heavily depends on efficient implementations. Libraries like NCCL implement collective operations that are optimized for GPUs and are influenced by HPC standards like MPI. Further efforts in optimizing communication lead to new optimization approaches.

In this thesis the student should perform a literature survey about current efforts in optimizing communication operations for GPU-based distributed machine learning. Different approaches should be compared and evaluated
regarding their effectiveness, their applicability to different applicationsand possible challenges should be discussed.

Kind of topic: overview
Supervisor: Dominik Viehhauser

Investigation of Performance Analysis Tools for Machine Learning Applications

The number of machine learning applications keep rising and most are trained or executed on GPU systems. To ensure an efficient use of the available hardware these workloads have to be understood and analyzed. This can be done by employing performance analysis tools such as profiler to record and collect performance metrics about an applications runtime. Available tools go from vendor-based tools like Nvidia Nsight to in-framework profiler like PyTorch's internal profiler. Using their metrics to judge the performance of machine learning training and inference is not straightforward and requires a more detailed look into the applicability of certain metrics to the workloads.

In this thesis the student will perform a literature research about different performance analysis tools and approaches. It should be investigated how suitable these are to analyze machine learning training and inference. This includes characterizing challenges that arise when analyzing these workloads and comparing different approaches that are currently proposed.

Kind of topic: overview
Supervisor: Dominik Viehhauser

Evaluating Energy Saving and Power Control Mechanisms for HPC

Increased energy prices and the fluctuating power generation from renewable energy sources create new challenges for the cost-efficient operation of sustainable HPC systems. At the same time, the under-lying hardware is becoming more heterogeneous featuring different compute and memory technologies within the same deployment. These developments raise the following questions: Which energy saving and power control mechanisms can be leveraged in modern HPC systems to address these complex operational challenges?

In this thesis, the student should categorize different approaches from related energy-efficiency research contrasting their effectiveness and critically evaluating their operational applicability in a production setting. In each scenario, emphasis should be placed on the nature of the interaction between the HPC user and operator to achieve the desired improvements. Considerations specific to the CLAIX-2023 clus-ter of the RWTH Aachen are welcome, which can provide helpful guidance for improving the real-world operation of the cluster.

Kind of topic: overview
Supervisor: Christian Wassermann

Calculating Carbon Footprints for Today’s HPC Systems

To limit climate change, net zero emissions are being targeted by many countries. A significant portion of today’s emissions originate from data centers and HPC systems due to the reliance on electricity for operation. A first step towards the reduction of their carbon emissions is an accurate and traceable calculation methodology. This calculation should not only include the operational carbon footprint from the continued operation but also the embodied carbon footprint attributed to one-off actions like the production of all components.

This seminar thesis should compare different carbon accounting methodologies from related literature assessing the following questions: Which system components are accounted for? How is each component accounted for? What data are the calculations based on? What is the scope of the calculations? Can the presented methods be transferred easily to other systems?