SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Checkpoint/Restart for CUDA Kernels


Workshop: Fourth International Symposium on Checkpointing for Supercomputing (SuperCheck-SC23)

Authors: Niklas Eiling, Stefan Lankes, and Antonello Monti (RWTH Aachen University)


Abstract: In HPC clusters, it has become common to employ Checkpoint/Restart, that is, saving the execution state of applications in order to restore their computational progress at a later point in time. The benefits of this technique for clusters include more flexibility when reacting to changing workloads and an increased fault tolerance. While many clusters already benefit from C/R tools for traditional CPU applications, there is a lack of comparable tools enabling preemptive and transparent C/R for heterogeneous computing, where applications execute partly on accelerator devices, such as GPUs. This is despite the increasing use of GPUs as accelerators in HPC clusters. Therefore, we propose a novel C/R tool that enables saving the execution state of CUDA kernels, thus allowing preemptive C/R of GPU. We show that full-featured C/R for NVIDIA GPUs is possible despite the proprietary nature of the hardware and software of these devices.





Back to Fourth International Symposium on Checkpointing for Supercomputing (SuperCheck-SC23) Archive Listing



Back to Full Workshop Archive Listing