A Reinforcement Learning-Based Backfilling Strategy for HPC Batch Jobs

SC23 Proceedings

Workshops Archive

A Reinforcement Learning-Based Backfilling Strategy for HPC Batch Jobs

Workshop: PMBS23: The 14th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems

Authors: Elliot Kolker-Hicks and Di Zhang (University of North Carolina at Charlotte) and Dong Dai (University of North Carolina, Charlotte)

Abstract: HPC systems employ a scheduling technique called “backfilling”, wherein low-priority jobs are scheduled earlier to use the available resources that are waiting for the pending high-priority jobs. Backfilling relies on job runtime to calculate the start time of the ready-to-schedule jobs and avoid delaying them. It is a common belief that better estimations of job runtime will lead to better backfilling and more effective scheduling. However, our experiments show a different conclusion: there is a missing trade-off between prediction accuracy and backfilling opportunities. To learn how to achieve the best trade-off, we believe reinforcement learning (RL) can be effectively leveraged. Based on this idea, we designed RLBackfilling, a reinforcement learning based backfilling algorithm. Our evaluation results show up to 17x better scheduling performance compared to EASY backfilling using user-provided job runtime and 4.7x better performance comparing with EASY using the ideal predicted job runtime (the actual job runtime).

Back to PMBS23: The 14th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems Archive Listing

Back to Full Workshop Archive Listing