Checkpoint/Restart for CUDA Kernels

SC23 Proceedings

Workshops Archive

Checkpoint/Restart for CUDA Kernels

Workshop: Fourth International Symposium on Checkpointing for Supercomputing (SuperCheck-SC23)

Authors: Niklas Eiling, Stefan Lankes, and Antonello Monti (RWTH Aachen University)

Abstract: In HPC clusters, it has become common to employ Checkpoint/Restart, that is, saving the execution state of applications in order to restore their computational progress at a later point in time. The benefits of this technique for clusters include more flexibility when reacting to changing workloads and an increased fault tolerance. While many clusters already benefit from C/R tools for traditional CPU applications, there is a lack of comparable tools enabling preemptive and transparent C/R for heterogeneous computing, where applications execute partly on accelerator devices, such as GPUs. This is despite the increasing use of GPUs as accelerators in HPC clusters. Therefore, we propose a novel C/R tool that enables saving the execution state of CUDA kernels, thus allowing preemptive C/R of GPU. We show that full-featured C/R for NVIDIA GPUs is possible despite the proprietary nature of the hardware and software of these devices.

Back to Fourth International Symposium on Checkpointing for Supercomputing (SuperCheck-SC23) Archive Listing

Back to Full Workshop Archive Listing