Seminar Current Topics in High-Performance Computing

Content

High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.

This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.

The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.

Schedule

This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.
Futhermore, we will introduce the students to "Scientific Writing" and "Scientific Presenting" in computer science. These two events are also compulsory in attendance.

The compulsory introductory event (kickoff) is scheduled for Tuesday, October 8, 10 a.m-12 p.m. The next compulsory meetings are planned for Friday, October 11, 12:30 p.m.-14 p.m and Monday, October 14, 12:30 p.m.-14 p.m.

Furthermore, we plan to do the seminar as an in-person event. That means that you need to be personally present for all compulsory parts of the seminar.

Registration/ Application

Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.

Requisites

The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.

Prerequisites

The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.

Language

We prefer and encourage students to do the report and presentation in English. But, German is also possible.

Types of Topics

We provide two flavors of seminar topics depending on the particular topic: (a) overview topics, and (b) dive-in topics. It works as the names suggest. Nevertheless, this categorization does not necessarily imply a strict "either-or" but rather provides a guideline for addressing the topic. In general, both types of topics are equally difficult to work on. However, they have different challenges. In the topic list below, you can also find the corresponding categorizations for the seminar topic types.

Topics

Processing-In-Memory for General Purpose HPC Applications

On one hand, many HPC workloads are limited by memory bandwidth. On the other hand, the concept of Processing-In-Memory (PIM) exists within computer architecture. PIM allows computing operations to be carried out directly within the memory, rather than sending data back and forth between memory and the CPU. This concept aims to enhance the performance and efficiency of computers, especially for data-intensive applications. However, the acceptance of PIM requires user-friendly programming concepts.

This seminar thesis has to investigate the research question of how HPC workloads can benefit from the use of PIM and how PIM can be applied in general-purpose programming models. The paper is expected to provide a general overview of PIM and analyze how it can be utilized for HPC applications, including a discussion of potential performance improvements. Additionally, there is the possibility to analyze and test the usability of PIM on CLAIX-2023. The seminar candidate has the opportunity to conduct an in-depth analysis of an existing end-to-end compiler implementation or to provide a broader overview of the topic, including a corresponding literature review.

Kind of topic: dive-in
Supervisor: Tim Cramer

Exploring Unified Memory for OpenMP Target Offloading on Modern Architectures

The trend towards utilizing powerful accelerators in the High-Performance Computing (HPC) continues to gain momentum. However, the complexity of these advanced hardware architectures often poses challenges for porting traditional HPC workloads. Unified memory architectures and parallel programming paradigms like OpenMP offer potential solutions to these challenges.

This seminar thesis aims to explore the potential of unified memory within OpenMP 5.2. The study must provide a comprehensive overview of this feature, along with an analysis of its advantages and disadvantages compared to explicit memory mapping techniques. Given performance measurements from traditional applications such as OpenFOAM running on contemporary GPUs must be presented and analyzed. Additionally, the applicability and transferability of these use cases to the CLAIX-2023 infrastructure can optionally be examined.

Kind of topic: dive-in
Supervisor: Tim Cramer

Modelling the performance of MPI communication using non-contiguous data types

The starting paper applies performance measurement and modelling techniques to quantify the performance of different implementations and optimization approaches to non-contiguous data communication on a variety of systems, demonstrating that modern communication system design approaches can result in widely-varying and difficult-to-predict performance variation, even within the same hardware/communication software combination.

The seminar thesis will summarize the results from the paper and either dive-in to reproduce the measurements on our cluster using the available MPI implementations or provide a broader overview on performance modelling studies for classes of MPI communication patterns.

Kind of topic: overview/dive-in
Supervisor: Joachim Jenke

Synchronizing MPI Processes in Space and Time

Performance benchmarks are an integral part of the development and evaluation of parallel algorithms, both in distributed applications as well as MPI implementations themselves. The initial step of the benchmark process is to obtain a common timestamp to mark the start of an operation across all involved processes, and the state-of-the-art in many applications and widely used MPI benchmark suites is the use of MPI barriers. The starting paper makes the point that barrier only synchronizes in space but not in time.

The seminar thesis will summarize the results from the paper and reproduce the measurements on our cluster using the available MPI implementations.

Kind of topic: dive-in
Supervisor: Joachim Jenke

Understanding I/O temporal behavior in the era of HPC-AI Applications (English only)

The current I/O behaviors in the HPC clusters are known to be periodic and bursty. This situation often creates resource contention in the network and storage, creating performance variability. These temporal behaviors of the applications can differ from one application to another, which creates a large variability of the performances that can happen inside the cluster. With AI applications starting to be part of the daily workloads in the HPC infrastructure, additional I/O behaviors are expected to be observed inside the cluster. Current research proposes the use of frequency techniques to perform a characterization and prediction for the I/O temporal behavior. This idea is interesting since it contrasts the current direction of using machine learning applications to provide us with understanding. It would be ideal if these frequency techniques were also applicable to modern HPC-AI workloads.

In this topic, the student is expected to conduct a comprehensive literature study on the changes in I/O temporal behavior in traditional HPC applications and modern AI workloads. Then, the student is expected to provide an insightful analysis of the applicability of frequency techniques for understanding HPC-AI workloads or explore other potentially effective techniques. The student can also choose to do a more hands-on approach to test FTIO applications with machine learning I/O benchmarks such as DLIO if they want.

Kind of topic: overview/dive-in
Supervisor: Radita Liem

Optimizing Energy and Performance in Heterogenous Computing Architecture (English only)

The integration of GPU into the modern HPC infrastructure makes the CPU-GPU heterogeneous becoming more common. In the other hand, concern to perform energy optimization is also gaining more importance. Energy optimization strategy and modeling need to accommodate the existing heterogeneous infrastructure. Multiple techniques and solutions are created to provide insights into the energy characterization, such as HEMP (Heterogeneity Energy-Minimizer with Performance constraints) which targets heterogeneous cores.

In this topic, the student is expected to perform a literature study to find out other techniques for energy optimization in heterogeneous infrastructure that involves GPUs and discuss whether it is possible to use the HEMP technique to model CPU-GPU performance instead of only heterogeneous cores.

Kind of topic: overview
Supervisor: Radita Liem

The Current State of Libraries for Coupling Numerical Simulations with Machine Learning

Machine learning is a promising technique to complement traditional numerical simulations of physical problems. For example in the field of reactive flow simulation, machine learning models have been investigated for data-driven turbulence modeling or reduced-order modeling for combustion closure. In many cases the trained models show a promising high accuracy when evaluated. However, the most important evaluation is to test the model's predictive performance in a real simulation. This requires the coupling of traditional numerical simulation codes with the trained machine learning model and leads to the computational challenge of exploiting heterogeneous architectures consisting of CPUs and GPUs. Recently, many different projects have emerged to deploy machine learning models into numerical simulation codes.

In this thesis the student should conduct a literature study to present an overview of currently available software libraries to couple numerical simulations with machine learning models. The student should elaborate on important features of these libraries including supported ML frameworks, support for GPU acceleration and support for parallel execution on distributed systems. Moreover, the student should highlight limitations of the investigated approaches. Optionally, own experiments with some of the investigated libraries may be possible to conduct on our HPC ML-partition of CLAIX-2023.

Kind of topic: overview
Supervisor: Fabian Orland

Challenges of Hybrid CPU + GPU Computing in Machine Learning Applications

In recent years machine learning models have been successfully applied to model complex physical phenomena to complement traditional numerical simulations. Before the model can be inferred for physically meaningful predictions, it needs to be trained on a huge dataset. Training large models requires a tremendous amount of computational power and would not be feasible without the use of GPUs or other specialized hardware accelerators. For example training a famous large language model takes around 34 days on 1024 V100 GPUs. Many applications offload most of the training computations or the inference of the model to GPUs leaving the available CPU cores underutilized. However, with respect to the advent of the first exascale computers, it will become important to fully exploit all available computational capacities.

In this thesis the student should first conduct a literature study to get an overview about existing success stories regarding hybrid CPU+GPU computing applied to traditional numerical computations. In a second step, the student should investigate current approaches applying hybrid CPU+GPU computation to machine learning applications. In the end, the student should compare these approaches and evaluate the special challenges of hybrid computations with respect to machine learning applications.

Kind of topic: dive-in
Supervisor: Fabian Orland

Collective Algorithms for the Exascale Era

With the ever-growing node size of modern supercomputers, the role of efficient communication becomes more important. Surveys indicate that collective communication makes up about 25% to 50% of the execution times of current HPC applications. Today's collective algorithms often do not leverage modern hardware software features missing out on important performance optimizations.

To prepare for the Exascale era, the student shall conduct a literature survey on modern collective algorithms for large-scale HPC networks. Selected collective algorithms for Exascale computing and their specific optimizations should be explained in the thesis together with achieved
performance improvements. Optionally, the student may investigate the impact of different collective algorithms on CLAIX using micro-benchmarks and/or real-world applications.

Kind of topic: overview
Supervisor: Felix Tomski

Program Partitioning and Deadlock Analysis for MPI Based on Logical Clocks

Deadlocks in parallel programs are notoriously difficult to debug because of the programs non-determinism. Most approaches either suffer from state explosion trying to mitigate false negatives by exploring all possible paths or miss deadlocks when only considering one specific execution path. The proposed approach tries to tackle the problem of path explosion by dividing the program into multiple communication-independent partitions, which can be checked for deadlocks independently of each other, reducing the search space.

In this thesis, the student should present the proposed deadlock detection approach. The explanation should also cover improvements and shortcomings of the new approach compared to other deadlock detection approaches, which should be briefly discussed additionally. Optionally, in case the tool is available, the student may perform own experiments with the tool on benchmarks of other testsuites, such as MPI-Corrbench or MPI Bugs Initiative.

Kind of topic: dive-in
Supervisor: Felix Tomski

ML-Benchmarking for HPC Procurements

Benchmarking is an essential part of the procurement of HPC systems to ensure that the requested performance of the new cluster will and was delivered by the vendor. While scientific HPC-focused benchmarks for CPUs and GPUs have been used extensively in the past, benchmarking of ML-related scientific applications introduces new challenges on HPC clusters and limited experience exists. For example, those benchmarks need to work on different device architectures (e.g., NVIDIA, AMD GPUs) across multiple compute nodes, consider device memory restrictions, account for reduced or mixed precision, and deliver reproducible runtimes (e.g., by defining run rules) in a given time frame. Moreover, they need to be sufficiently sophisticated to be part of an official HPC tender.

This seminar thesis shall investigate ML benchmarking for HPC systems while focusing on the challenges of procuring an HPC system, and keeping the specialities of ML applications in mind. ML benchmarks and benchmark initiatives such as MLPerf HPC, HPL-AI, Deep500, SPEC ML or LLMFoundry shall be investigated correspondingly, compared with each other, and challenges and potential solutions examined (where possible). At best, details on whether the benchmarks have been already used in an HPC procurement shall be added.
Optional: Own experiences of using ML benchmarks can be presented after trying some of them on our HPC ML-partition of CLAIX-2023.

Kind of topic: overview
Supervisor: Sandra Wienke

Supervisors

Tim Cramer
Joachim Jenke
Radita Liem
Fabian Orland
Ben Thärigen
Felix Tomski
Sandra Wienke

Kontakt

Ben Thärigen

E-Mail schreiben

Tools

Services

Einrichtungen

Lehrstuhl für Hochleistungsrechnen (Informatik 12)