Scalable Fine-Grained Gang Scheduling for HPC Systems with Unreliable Broadcast Synchronization Mechanisms

Authors: Hiroki Ohtsuji, Erika Hayashi, Reika Kinoshita, Masahiro Miwa, and Eiji Yoshida (Fujitsu Ltd)

Abstract: The demand for interactivity on HPC systems is increasing, primarily driven by new HPC users from the AI/ML research area. Traditional HPC users are accustomed to waiting for job execution on a batch scheduling system while new users prefer an interactive terminal such as Jupyter Notebook. To address these evolving requirements, enhancing interactivity is essential. Fine-grained gang scheduling is one potential solution for this problem. This poster presents a scalable inter-node synchronization mechanism that facilitates well-time-aligned synchronization message delivery through broadcast communication for fine-grained gang scheduling in HPC systems. The mechanism improved the application performance by 2.7 times in comparison to the existing implementation, when simultaneously executing two parallel applications on 128 computing nodes with a 500 ms time slice.

