SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

When to Checkpoint at the End of a Fixed-Length Reservation?


Workshop: 13th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2023)

Authors: Quentin Barbut and Anne Benoit (ENS Lyon); Thomas Herault (University of Tennessee); Yves Robert (ENS Lyon, University of Tennessee); and Frédéric Vivien (INRIA)


Abstract: Consider an application executing for a fixed duration. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. In the first scenario, a checkpoint can be taken at any time.

We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. In the second scenario, the application is a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the checkpoint at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue execution at the end of each task.





Back to 13th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2023) Archive Listing



Back to Full Workshop Archive Listing