Topics
Processing-In-Memory for General Purpose HPC Applications
On one hand, many HPC workloads are limited by memory bandwidth. On the other hand, the concept of Processing-In-Memory (PIM) exists within computer architecture. PIM allows computing operations to be carried out directly within the memory, rather than sending data back and forth between memory and the CPU. This concept aims to enhance the performance and efficiency of computers, especially for data-intensive applications. However, the acceptance of PIM requires user-friendly programming concepts.
This seminar thesis has to investigate the research question of how HPC workloads can benefit from the use of PIM and how PIM can be applied in general-purpose programming models. The paper is expected to provide a general overview of PIM and analyze how it can be utilized for HPC applications, including a discussion of potential performance improvements. Additionally, there is the possibility to analyze and test the usability of PIM on CLAIX-2023. The seminar candidate has the opportunity to conduct an in-depth analysis of an existing end-to-end compiler implementation or to provide a broader overview of the topic, including a corresponding literature review.
Kind of topic: dive-in
Supervisor: Tim Cramer
Exploring Unified Memory for OpenMP Target Offloading on Modern Architectures
The trend towards utilizing powerful accelerators in the High-Performance Computing (HPC) continues to gain momentum. However, the complexity of these advanced hardware architectures often poses challenges for porting traditional HPC workloads. Unified memory architectures and parallel programming paradigms like OpenMP offer potential solutions to these challenges.
This seminar thesis aims to explore the potential of unified memory within OpenMP 5.2. The study must provide a comprehensive overview of this feature, along with an analysis of its advantages and disadvantages compared to explicit memory mapping techniques. Given performance measurements from traditional applications such as OpenFOAM running on contemporary GPUs must be presented and analyzed. Additionally, the applicability and transferability of these use cases to the CLAIX-2023 infrastructure can optionally be examined.
Kind of topic: dive-in
Supervisor: Tim Cramer
Modelling the performance of MPI communication using non-contiguous data types
The starting paper applies performance measurement and modelling techniques to quantify the performance of different implementations and optimization approaches to non-contiguous data communication on a variety of systems, demonstrating that modern communication system design approaches can result in widely-varying and difficult-to-predict performance variation, even within the same hardware/communication software combination.
The seminar thesis will summarize the results from the paper and either dive-in to reproduce the measurements on our cluster using the available MPI implementations or provide a broader overview on performance modelling studies for classes of MPI communication patterns.
Kind of topic: overview/dive-in
Supervisor: Joachim Jenke
Synchronizing MPI Processes in Space and Time
Performance benchmarks are an integral part of the development and evaluation of parallel algorithms, both in distributed applications as well as MPI implementations themselves. The initial step of the benchmark process is to obtain a common timestamp to mark the start of an operation across all involved processes, and the state-of-the-art in many applications and widely used MPI benchmark suites is the use of MPI barriers. The starting paper makes the point that barrier only synchronizes in space but not in time.
The seminar thesis will summarize the results from the paper and reproduce the measurements on our cluster using the available MPI implementations.
Kind of topic: dive-in
Supervisor: Joachim Jenke
Understanding I/O temporal behavior in the era of HPC-AI Applications (English only)
The current I/O behaviors in the HPC clusters are known to be periodic and bursty. This situation often creates resource contention in the network and storage, creating performance variability. These temporal behaviors of the applications can differ from one application to another, which creates a large variability of the performances that can happen inside the cluster. With AI applications starting to be part of the daily workloads in the HPC infrastructure, additional I/O behaviors are expected to be observed inside the cluster. Current research proposes the use of frequency techniques to perform a characterization and prediction for the I/O temporal behavior. This idea is interesting since it contrasts the current direction of using machine learning applications to provide us with understanding. It would be ideal if these frequency techniques were also applicable to modern HPC-AI workloads.
In this topic, the student is expected to conduct a comprehensive literature study on the changes in I/O temporal behavior in traditional HPC applications and modern AI workloads. Then, the student is expected to provide an insightful analysis of the applicability of frequency techniques for understanding HPC-AI workloads or explore other potentially effective techniques. The student can also choose to do a more hands-on approach to test FTIO applications with machine learning I/O benchmarks such as DLIO if they want.
Kind of topic: overview/dive-in
Supervisor: Radita Liem
Optimizing Energy and Performance in Heterogenous Computing Architecture (English only)
The integration of GPU into the modern HPC infrastructure makes the CPU-GPU heterogeneous becoming more common. In the other hand, concern to perform energy optimization is also gaining more importance. Energy optimization strategy and modeling need to accommodate the existing heterogeneous infrastructure. Multiple techniques and solutions are created to provide insights into the energy characterization, such as HEMP (Heterogeneity Energy-Minimizer with Performance constraints) which targets heterogeneous cores.
In this topic, the student is expected to perform a literature study to find out other techniques for energy optimization in heterogeneous infrastructure that involves GPUs and discuss whether it is possible to use the HEMP technique to model CPU-GPU performance instead of only heterogeneous cores.
Kind of topic: overview
Supervisor: Radita Liem
The Current State of Libraries for Coupling Numerical Simulations with Machine Learning
Machine learning is a promising technique to complement traditional numerical simulations of physical problems. For example in the field of reactive flow simulation, machine learning models have been investigated for data-driven turbulence modeling or reduced-order modeling for combustion closure. In many cases the trained models show a promising high accuracy when evaluated. However, the most important evaluation is to test the model's predictive performance in a real simulation. This requires the coupling of traditional numerical simulation codes with the trained machine learning model and leads to the computational challenge of exploiting heterogeneous architectures consisting of CPUs and GPUs. Recently, many different projects have emerged to deploy machine learning models into numerical simulation codes.
In this thesis the student should conduct a literature study to present an overview of currently available software libraries to couple numerical simulations with machine learning models. The student should elaborate on important features of these libraries including supported ML frameworks, support for GPU acceleration and support for parallel execution on distributed systems. Moreover, the student should highlight limitations of the investigated approaches. Optionally, own experiments with some of the investigated libraries may be possible to conduct on our HPC ML-partition of CLAIX-2023.
Kind of topic: overview
Supervisor: Fabian Orland
Challenges of Hybrid CPU + GPU Computing in Machine Learning Applications
In recent years machine learning models have been successfully applied to model complex physical phenomena to complement traditional numerical simulations. Before the model can be inferred for physically meaningful predictions, it needs to be trained on a huge dataset. Training large models requires a tremendous amount of computational power and would not be feasible without the use of GPUs or other specialized hardware accelerators. For example training a famous large language model takes around 34 days on 1024 V100 GPUs. Many applications offload most of the training computations or the inference of the model to GPUs leaving the available CPU cores underutilized. However, with respect to the advent of the first exascale computers, it will become important to fully exploit all available computational capacities.
In this thesis the student should first conduct a literature study to get an overview about existing success stories regarding hybrid CPU+GPU computing applied to traditional numerical computations. In a second step, the student should investigate current approaches applying hybrid CPU+GPU computation to machine learning applications. In the end, the student should compare these approaches and evaluate the special challenges of hybrid computations with respect to machine learning applications.
Kind of topic: dive-in
Supervisor: Fabian Orland
Collective Algorithms for the Exascale Era
With the ever-growing node size of modern supercomputers, the role of efficient communication becomes more important. Surveys indicate that collective communication makes up about 25% to 50% of the execution times of current HPC applications. Today's collective algorithms often do not leverage modern hardware software features missing out on important performance optimizations.
To prepare for the Exascale era, the student shall conduct a literature survey on modern collective algorithms for large-scale HPC networks. Selected collective algorithms for Exascale computing and their specific optimizations should be explained in the thesis together with achieved
performance improvements. Optionally, the student may investigate the impact of different collective algorithms on CLAIX using micro-benchmarks and/or real-world applications.
Kind of topic: overview
Supervisor: Felix Tomski
Program Partitioning and Deadlock Analysis for MPI Based on Logical Clocks
Deadlocks in parallel programs are notoriously difficult to debug because of the programs non-determinism. Most approaches either suffer from state explosion trying to mitigate false negatives by exploring all possible paths or miss deadlocks when only considering one specific execution path. The proposed approach tries to tackle the problem of path explosion by dividing the program into multiple communication-independent partitions, which can be checked for deadlocks independently of each other, reducing the search space.
In this thesis, the student should present the proposed deadlock detection approach. The explanation should also cover improvements and shortcomings of the new approach compared to other deadlock detection approaches, which should be briefly discussed additionally. Optionally, in case the tool is available, the student may perform own experiments with the tool on benchmarks of other testsuites, such as MPI-Corrbench or MPI Bugs Initiative.
Kind of topic: dive-in
Supervisor: Felix Tomski
ML-Benchmarking for HPC Procurements
Benchmarking is an essential part of the procurement of HPC systems to ensure that the requested performance of the new cluster will and was delivered by the vendor. While scientific HPC-focused benchmarks for CPUs and GPUs have been used extensively in the past, benchmarking of ML-related scientific applications introduces new challenges on HPC clusters and limited experience exists. For example, those benchmarks need to work on different device architectures (e.g., NVIDIA, AMD GPUs) across multiple compute nodes, consider device memory restrictions, account for reduced or mixed precision, and deliver reproducible runtimes (e.g., by defining run rules) in a given time frame. Moreover, they need to be sufficiently sophisticated to be part of an official HPC tender.
This seminar thesis shall investigate ML benchmarking for HPC systems while focusing on the challenges of procuring an HPC system, and keeping the specialities of ML applications in mind. ML benchmarks and benchmark initiatives such as MLPerf HPC, HPL-AI, Deep500, SPEC ML or LLMFoundry shall be investigated correspondingly, compared with each other, and challenges and potential solutions examined (where possible). At best, details on whether the benchmarks have been already used in an HPC procurement shall be added.
Optional: Own experiences of using ML benchmarks can be presented after trying some of them on our HPC ML-partition of CLAIX-2023.
Kind of topic: overview
Supervisor: Sandra Wienke
Supervisors
Tim Cramer
Joachim Jenke
Radita Liem
Fabian Orland
Ben Thärigen
Felix Tomski
Sandra Wienke