SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Doctoral Showcase Archive

Enabling Reproducibility and Scalability of Scientific Workflows in HPC and Cloud

Author: Paula Olaya (University of Tennessee)

Advisor: Michela Taufer (ACM, University of Tennessee)

Abstract: Scientific communities across fields like earth science, biology, and materials science increasingly run complex workflows for their scientific discovery. We work closely with these communities to leverage high-performance computing (HPC), big data analytics, and artificial intelligence/machine learning (AI/ML) to increase and accelerate their workflows’ productivity. Our work addresses the new challenges brought about by this optimization process.

We identify three main challenges in these workflows: i) they integrate AI/ML methods with limited transparency and include many interoperable components (data and applications) that are hard to trace and reuse to reproduce results; ii) they hide the complexity of large intermediate data and their overall execution can be affected by the I/O bandwidth of the underlying infrastructure; and iii) they run on heterogeneous and distributed infrastructure with data and application dependencies that require efficient data management and resource allocation.

To address these challenges, we provide solutions that leverage the convergence between high-performance and cloud computing. First, we design and develop fine-grained containerized environments that enable data traceability and results explainability by automatically annotating and seamlessly attaching provenance information. Second, since the workflows are already containerized, we integrate them in HPC and native-cloud infrastructure and tune the storage technology to enable better I/O and data scalability. Finally, we orchestrate the end-to-end execution of workflows, ensuring efficient allocation of infrastructure resources and intermediate data management, and supporting reproducibility and reusability of workflows’ executions.

Thesis Canvas: pdf

Back to Doctoral Showcase Archive Listing