Publication

Evaluierung von Optimierungsstrategien zur Datensatzspeicherung für machinelles Lernen auf HPC Systemen

  • Evaluating Optimization Strategies for Dataset Storage for Machine Learning Workloads on HPC Systems

Mainka, Irmin; Müller, Matthias S. (Thesis advisor); Kunkel, Julian (Thesis advisor); Viehhauser, Dominik (Consultant)

Aachen : RWTH Aachen University (2025)
Bachelor Thesis

Bachelorarbeit, RWTH Aachen University, 2025

Abstract

Traditional Machine Learning Datasets used to train models are often used in aform consisting of a large amount of small files. This property is detrimental totheir widespread use on HPC systems due to the way parallel filesystems work.Several other ways to store such datasets can be found in the areas of both HPCand Python programming. Strategies for both storing and loading datasets aretested in experiments in this thesis. These experiments focus on training an ImageClassification model. The strategies used in this thesis include the usage of numpyarrays, LMDB, HDF5 and Zarr. The results are then used to evaluate how thedifferent strategies compare to each other. The goal of this thesis is to either finda performant strategy using fewer files or validate the usage of the strategy usingmany small files.

Institutions

  • IT Center [022000]
  • Department of Computer Science [120000]
  • Chair of High Performance Computing (Computer Science 12) [123010]