- Evaluating Optimization Strategies for Dataset Storage for Machine Learning Workloads on HPC Systems
Mainka, Irmin; Müller, Matthias S. (Thesis advisor); Kunkel, Julian (Thesis advisor); Viehhauser, Dominik (Consultant)
Aachen : RWTH Aachen University (2025)
Bachelor Thesis
Bachelorarbeit, RWTH Aachen University, 2025
Abstract
Traditional Machine Learning Datasets used to train models are often used in aform consisting of a large amount of small files. This property is detrimental totheir widespread use on HPC systems due to the way parallel filesystems work.Several other ways to store such datasets can be found in the areas of both HPCand Python programming. Strategies for both storing and loading datasets aretested in experiments in this thesis. These experiments focus on training an ImageClassification model. The strategies used in this thesis include the usage of numpyarrays, LMDB, HDF5 and Zarr. The results are then used to evaluate how thedifferent strategies compare to each other. The goal of this thesis is to either finda performant strategy using fewer files or validate the usage of the strategy usingmany small files.
Institutions
- IT Center [022000]
- Department of Computer Science [120000]
- Chair of High Performance Computing (Computer Science 12) [123010]