Workshop: 13th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2023)
Authors: Quentin Barbut and Anne Benoit (ENS Lyon); Thomas Herault (University of Tennessee); Yves Robert (ENS Lyon, University of Tennessee); and Frédéric Vivien (INRIA)
Abstract: Consider an application executing for a fixed duration. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. In the first scenario, a checkpoint can be taken at any time.
We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. In the second scenario, the application is a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the checkpoint at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue execution at the end of each task.
Back to 13th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2023) Archive Listing