Publication

A comprehensive data analytics framework to support research data management in distributed systems

Yazdi, Mohammad Amin; Müller, Matthias S. (Thesis advisor); Decker, Stefan Josef (Thesis advisor)

Aachen : RWTH Aachen University (2023, 2024)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2023

Abstract

Effective \gls{rdm} practices are essential for fostering research collaboration, increasing discoverability and repurposing research data, and advancing scientific progress in higher education. In recent years, adopting \glspl{osp} and the \gls{fair} data principles has highlighted the need for improved RDM methodologies and tools for flourishing higher education achievements. However, existing literature has provided limited guidance on monitoring RDM processes, their adoption, and their use. This dissertation addresses this gap by investigating how to enable discovering and enhancing process-aware RDM activities via modeling the underlying researcher's actual practices.This dissertation presents a series of methodologies as a framework combining data acquisition, abstraction, knowledge discovery, and operation enhancement techniques. Furthermore, the case studies highlight the challenges associated with RDM-related activities by assessing the proposed methodologies' validity in real-world environments. Initially, this work presents a universal reference software architecture for RDM services; then, it proposes four approaches for data acquisition, including a novel Hybrid logger technique for acquiring datasets from information systems that operate on distributed settings, providing a comprehensive view of user activities by evaluating corresponding software component executions. This approach enables a projection of user behavior and facilitates the development of further machine-learning studies. Furthermore, this work introduces a semi-supervised learning approach for abstracting datasets by accommodating non-sequential events in distributed systems while balancing data granularity and model fitness. The methodology for discovering process-aware activities incorporates a modular and layered architecture, providing insights into RDM compliance, identifying deviations, and optimizing user experience. Additionally, it outlines a method for determining and visualizing the user and system interactions and discovers the RDM phases of research projects, providing a practical understanding of the progression and activities of different research groups.Finally, this thesis proposes and evaluates two recommender systems, demonstrating the potential of Content-Based and Collaborative Filtering recommender systems in enabling the reusability of research data repositories and fostering cooperation among researchers. The findings contribute significantly to the expanding body of literature on RDM and provide valuable insights into the potential of the presented methodologies for enhancing RDM practices in OSPs.In conclusion, this dissertation offers holistic strategies for addressing the difficulties related to facilitating RDM in OSPs, providing guidelines for implementing necessary architecture and demonstrating the applicability of the proposed methods to other RDM services that adhere to the reference software architecture of RDM systems.

Institutions

  • Department of Computer Science [120000]
  • Chair of High Performance Computing (Computer Science 12) [123010]