TaskVine: Managing In-Cluster Storage for High-Throughput Data Intensive Workflows

SC23 Proceedings

Workshops Archive

TaskVine: Managing In-Cluster Storage for High-Throughput Data Intensive Workflows

Workshop: The 18th Workshop on Workflows in Support of Large-Scale Science (WORKS23) - Part 2 of 2

Authors: Barry Sly-Delgado, Thanh Son Phung, Colin Thomas, David Simonetti, Andrew Hennessee, Ben Tovar, and Douglas Thain (University of Notre Dame)

Abstract: Many scientific applications are expressed as high-throughput workflows that consist of large graphs of data assets and tasks to be executed on large parallel and distributed systems. A challenge in executing these workflows is managing data: both datasets and software must be efficiently distributed to cluster nodes; intermediate data must be conveyed between tasks; output data must be delivered to its destination. Scaling problems result when these actions are performed in an uncoordinated manner on a shared filesystem. To address this problem, we introduce TaskVine: a system for exploiting the aggregate local storage and network capacity of a large cluster. TaskVine tracks the lifetime of data in a workflow --from archival sources to final outputs-- making use of local storage to distribute and re-use data. We describe the architecture and novel capabilities of TaskVine, and demonstrate its use with applications in genomics, high energy physics, molecular dynamics, and machine learning.

Back to The 18th Workshop on Workflows in Support of Large-Scale Science (WORKS23) - Part 2 of 2 Archive Listing

Back to Full Workshop Archive Listing