Seminar Current Topics in High-Performance Computing
High-performance computing is applied to speedup long-running scientific applications, for instance the simulation of computational fluid dynamics (CFD). Today's supercomputers often base on commodity processors, but also have different facets: from clusters over (large) shared-memory systems to accelerators (e.g., GPUs). Leveraging these systems, parallel computing with, e.g., MPI, OpenMP or CUDA must be applied.
This seminar focuses on current research topics in the area of HPC and is based on conference and journal papers. Topics might cover, e.g., novel parallel computer architectures and technologies, parallel programming models, current methods for performance analysis & correctness checking of parallel programs, performance modeling or energy efficiency of HPC systems. The seminar consists of a written study and the presentation of a specific topic.
The objectives of this seminar are the independent elaboration of an advanced topic in the area of high-performance computing and the classification of the topic in the overall context. This includes the appropriate preparation of concepts, approaches and results of the given topic (also with respect to formalities and time schedule), as well as a clear presentation of the contents. Furthermore, the students’ independent work is to be emphasized by looking beyond the edge of one's own nose.
This seminar belongs to the area of applied computer science. The topics are assigned during the introductory event. Then, the students work out the topics over the course of the semester. The corresponding presentations take place as block course one day (or two days) at the end of the lecture period or in the exam period. Attendance is compulsory for the introductory event and the presentation block.
More information is available in RWTHmoodle.
Seats for this seminar are distributed by the global registration process of the computer science department only. We appreciate if you state your interest in HPC, and also your pre-knowledge in HPC (e.g., relevant lectures, software labs, and seminars that you have passed) in the corresponding section during the registration process.
The goals of a seminar series are described in the corresponding Bachelor and Master modules. In addition to the seminar thesis and its presentation, Master students will have to lead one set of presentations (roughly 3 presentations) as session chair. A session chair makes sure that the session runs smoothly. This includes introducing the title of the presentation and its authors, keeping track of the speaker time and leading a short discussion after the presentation. Further instructions will be given during the seminar.
The attendance of the lecture "Introduction to High-Performance computing" (Prof. Müller) is helpful, but not required.
We prefer and encourage students to do the report and presentation in English. But, German is also possible.
A Massively Parallel Infrastructure for Adaptive Multiscale Simulations
When computationally modelling biological (or other soft-matter) systems and processes, one often faces a particular challenge: the phenomena under investigation depend on the microscopic details but should evolve over much larger, macroscopic length- and time-scales. Multiscale modelling has become increasingly important to bridge this gap. Quite another challenge is to execute such models on current petascale computers with their high levels of parallelism and heterogeneous architectures. A recent answer to these challenges is a massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), which couples a macro scale model spanning micrometer length- and millisecond time-scales with a micro scale model employing molecular dynamics simulations. MuMMI is a transferable infrastructure designed for scalability and efficient execution on heterogeneous architectures: a center workflow manager simultaneously allocates GPUs and CPUs while robustly handling failures in compute nodes, communication networks, and filesystems.
The thesis should present the MuMMI infrastructure, elucidate its machinery, and compare it with other approaches.
Data-centric Programming Models for High-Performance Computing
The demand for computationally intensive problems and especially problems including huge datasets is raising rapidly. To satisfy this demand, computing resources are becoming more specialized and heterogeneous. Thus, the application developers require high development efforts to efficiently map to such hardware architectures. To reduce these efforts and accelerate the development, several data-centric programming models have been proposed. These include software abstractions and domain-specific languages that allow for high-level specifications of the applications. Furthermore, runtime systems and resource managers were proposed, to ease the mapping to heterogeneous systems.
The goal of this seminar thesis is to provide an overview of the landscape of the data-centric programming models for HPC workloads. This includes a comparison of the models and an investigation into their use cases. If possible, experiments with different models can be carried out.
Auto-tuning of High-Performance Computing Applications
Auto-tuning describes a technique to optimize the performance of applications empirically. It leverages data collected during the execution of an application which can extend the static analyses of compilers. Furthermore, it allows for machine dependent optimization and, thus, support the performance portability of applications. It typically consists of an application-specific search space that includes the tuning parameters, a cost function to optimize for, and an automatic search algorithm to minimize the cost function. The design of such an auto-tuner can require extensive knowledge from the domain developer and was therefore typically used for optimized libraries or kernels in the domain of HPC.
This seminar thesis shall analyze the state-of the art auto-tuning frameworks for HPC applications and investigate the use of general auto-tuning frameworks. If possible, experiments with the investigated auto-tuning frameworks can be carried out.
Pattern-based Languages for High-Performance Computing
Parallel programming is a challenging and time-consuming task. To improve the development process, best practices in form of parallel design patterns are leveraged. They contain template solutions for commonly occurring problems such as map and reduce operations. Several approaches for integrating these design patterns in the development process were proposed including language extensions, parallel programming models, intermediate representations, and development processes to find and explore parallelism.
The goal of this seminar thesis is to investigate the different pattern-based languages for HPC applications. This includes a comparison of the models and an investigation into their use cases. If possible, experiments with different models can be carried out.
Evolution of Parallel Matrix-Matrix-Multiplication
One of the most fundamental operations used in scientific computations involving linear algebra algorithms is the matrix-matrix-multiplication. Since the computational complexity scales cubically with the problem size it likely accounts for a significant portion of the runtime of an HPC application. Hence, optimizing this operation is of crucial interest to speed up a broad variety of applications that build upon it. Traditional algorithms decompose the matrix data and map it onto a grid of processors. Most recently a new algorithm called COSMA has been developed which optimizes the matrix-matrix-multiplication by minimizing data transfers.
The thesis should present the evolution of parallel matrix-matrix-multiplication algorithms by giving an overview of different algorithms eventually concluding with the COSMA algorithm.
AI in HPC applications: Learning to simulate complex physical models
Data driven approaches such as Machine Learning become more and more popular across different diciplines. In computer vision, for example, neural networks have been successfully trained to recognize humans in an image or video stream to be used in self-driving cars. Moreover, in speech recognition neural networks are able to translate from one language to another language.
In HPC-related software these techniques are not yet used very often. HPC applications simulating complex physics need to produce results with a high accuracy. Machine learning techniques naturally introduce a certain error when learning a certain model. These methods will probably not fully replace any scientific simulation approach that has been grown over the many decades but they can complement and support them. They are able to produce a preview of a realistic physics simulation within a small fraction of time compared to a full fletched simulation run which can be used for prototyping or exploring different parameters of physical models. Also, in cases where no solution strategy, equation or model is known yet machine learning can be used to investigate these cases.
Recently, convolutional neural networks were successfully applied to accelerate an Eulerian fluid simulation by replacing the very expensive "pressure projection" step with a trained network. Furthermore, another work proposed an "physics informed neural network" that is able to accurately learn partial differential equations describing complex physics.
In this thesis the student should present an overview of the current state of the art of using machine learning techniques in HPC applications to learn complex physical models. Therefore the student should start by investigating "physics informed neural networks" and then exploring different other approaches. The student should present current approaches from different fields such as computational fluid dynamics (CFD), molecular dynamics (MD), density functional theory (DFT), or others. Finally, the student should also discuss problems and limitations of each presented approach.
Embedding of time series data
With the advent of regular sampling of Hardware Performance Data in modern HPC centers, the handling of time series data gains in´importance. Handling this data the challenge is to store this data efficiently and extract and use the contained information content while simultaneously surpressing and ideally dropping noise.
Focus of this seminar topic is to investigate the embedding methodology for time series and to contrast this to other dimensionality reduction techniques for time series.
Supervised vs unsupervised learning in HPC anomaly detection
With the advent of regular sampling of Hardware Performance Data in modern HPC centers, the classification of anomalous behaviour of both the hardware as well as the user is an important goal in current HPC operations. One possibility of doing this are machine learning approaches in either supervised or unsupervised form.
The focus of this seminar topic is to give an overview over the currently employed learning systems in HPC and to compare the supervised to the unsupervised approaches.
Performance Variation and system utilization
Performance Variation is an important metric of modern HPC systems. If an application has differing runtime on different compute nodes at different times, this causes problems both in sheduling as well as load balance, reducing the efficiency of the hpc system. Analyzing the dependence between system utilization and this performance variation could help reducing the impact of this problem by adjusting the sheduling to correct for this behaviour.
Focus of this seminar topic is to give an overview over the performance variability and the correlation or cauzation of system utilization. Optionally this can be looked into on the RWTH Aachen Cluster.
Detecting Memory Consistency Errors in MPI One-Sided Applications
The Message Passing Interface (MPI) enables nodes in a cluster to communicate to each other. The classical kind of communication in MPI two-sided point-to-point communication: The sending node sends the message to a receiving node which actively waits for the message to arrive. In MPI one-sided communication on the other hand, the sending node can directly modify the memory of a target node without the target node being involved in the communication. This has the advantage that the receiving node does not have to wait for the message to arrive, but instead it can continue with its computation. In modern MPI implementations, one-sided communication is achieved via Remote Direct Memory Accesses (RDMA).
Using MPI one-sided communication introduces a new kind of error class: Since the sending node can access remote memory directly, a concurrent access to the same memory location from the target node itself (or another sending node) can lead to memory inconsistencies if no proper synchronization is enforced. MC-CChecker is a correctness checking tool that tries to detect these kinds of memory inconsistencies using vector clocks to track causality between memory accesses.
The goal of the seminar thesis is to give a short overview of the different kinds of memory consistency errors that can occur in MPI one-sided communication. Then, the main concepts of the correctness checking tool MC-CChecker should be presented. Further, a literature review of approaches related to MC-CChecker should be presented.
Efficient Data Race Detection Techniques Based on Vector Clocks
Vector clocks belong to the class of so called logical clocks used to track causality in distributed systems. For a system with n processors, each processor manages an array of n integers that represents the vector clock. Whenever an event (memory access, synchronization) occurs at a processor i, the processor increments the i-th entry of its locally stored vector clock. If the event is a synchronization event, then the vector clocks are exchanged and merged between participating processes depending on the kind of synchronization. Based on the vector clock information, we can say for any pair (a,b) of two events in the system, that "a happens before b", "b happens before a" or "a is concurrent to b". This information is in particular useful for data race detectors or concurrency bug detection tools in general.
A significant drawback of the vector clock approach is its linearly growing size with a growing number of processors in the system. For a system with a high number of processors, vector clocks lead to a high amount of communication, storage and computation cost for data race detectors. Different techniques trying to avoid this overhead have been proposed. For example, there are approaches like FastTrack that try to store only certain parts of the vector clock and nevertheless can detect data races without losing precision. Another proposal is LiteRace, which has slightly less precision due to sampling of accesses, but has less runtime overhead than a complete analysis. Further, there are generic approaches proposing alternative ways of encoding the vector clocks, e.g. using prime numbers.
The thesis should provide an overview of the different data race detection techniques based on vector clocks. This includes a literature review of approaches and implementations that have been proposed in the past. Further, the different techniques should be compared, in particular in terms of accuracy as well as storage, computation and communication overhead.
CIVL: Formal Verification of Parallel Programs
Verification of parallel programs via static analysis is a challenging task: Beside the state explosion due to the large number of different execution paths / schedules, another problem are the different ways of writing parallel programs using different "dialects": MPI for distributed memory, OpenMP for shared memory, CUDA for GPUs, etc. This requires adaptation of verification algorithms to the syntax and semantics of each parallel programming model, it gets even more complex if a combination of parallel programming models ("hybrid programming") is used in a single program. CIVL-C ("Concurrency Intermediate Verification Language") tackles these problems: It is a generic C language enriched with generic concurrency constructs. Programs written in any concurrency dialect (MPI, OpenMP, CUDA) can be translated to CIVL-C. Verification algorithms or tools based on CIVL-C can then just run on a program with any concurrency dialect (MPI, OpenMP, CUDA) by running on the translated CIVL-C program. In other words, verifying a new concurrency dialect only requires writing a translator to the CIVL-C language.
The thesis should give an overview of the CIVL-C approach and highlight its strengths and weaknesses. Optionally, own experiments regarding precision of the approach can be performed.