SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Fluxion: A Scalable Graph-Based Resource Model for HPC Scheduling Challenges


Workshop: The 18th Workshop on Workflows in Support of Large-Scale Science (WORKS23) - Part 2 of 2

Authors: Tapasya Patki (Lawrence Livermore National Laboratory); Dong Ahn (NVIDIA Corporation); Daniel Milroy, Jae-Seung Yeom, Jim Garlick, and Mark Grondona (Lawrence Livermore National Laboratory); Stephen Herbein (NVIDIA Corporation); and Thomas Scogland (Lawrence Livermore National Laboratory)


Abstract: The current era of exascale supercomputing and the emergence of a computing continuum present several significant resource management challenges. These include, but are not limited to, management of complex scientific workflows, diverse resources such as power, elasticity in user jobs, and converged environments. The resource models that underpin today's job scheduling frameworks reflect the node- (or core-) centric system architectures prevalent when the frameworks were designed. Consequently, they are not suited to capturing resource relationships or dynamism. This greatly limits their applicability to the emerging multifaceted challenges in high-performance computing (HPC) and other converged environments. We propose a scalable graph-based resource model to overcome these challenges, which allows for representation of complex, changing resource relationships and multiple containment hierarchies. We implement this model, Fluxion, in a production-quality framework, and evaluate its performance. Additionally, we present emerging and advanced scheduling use cases that are enabled by our model.





Back to The 18th Workshop on Workflows in Support of Large-Scale Science (WORKS23) - Part 2 of 2 Archive Listing



Back to Full Workshop Archive Listing