Workshop: Fourth International Symposium on Checkpointing for Supercomputing (SuperCheck-SC23)
Authors: Yao Xu, Leonid Belyaev, and Twinkle Jain (Northeastern University); Derek Schafer (University of New Mexico); Anthony Skjellum (Tennessee Tech University); and Gene Cooperman (Northeastern University)
Abstract: This work presents experience with traditional use cases of checkpointing on a novel platform. A single codebase (MANA) transparently checkpoints production workloads for major, available MPI implementations: "develop once, run everywhere''. The new platform allows application developers to compile their application against any of the available standards-compliant MPI implementations, and test each MPI implementation according to performance or other features.
Since its original academic prototype, MANA has been under development for three of the past four years, and is planned to enter full production at NERSC in early Fall of 2023. To the best of the authors' knowledge, MANA is currently the only production-capable, system-level checkpointing package running on a large supercomputer (Perlmutter at NERSC) using a major MPI implementation (HPE Cray MPI). Experiments are presented on several large production workloads, showing low runtime overhead with one codebase supporting four MPI implementations: HPE Cray MPI, MPICH, Open MPI, and ExaMPI.