Workshop: 13th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2023)
Authors: Kurt Ferreira (Sandia National Laboratories, University of New Mexico) and Scott Levy (Sandia National Laboratories)
Abstract: Fault tolerance remains a key challenge for current high performance computing systems. Effective and efficient scheduling of mitigation methods continues to be a critical issue in the face of dynamic and difficult-to-predict error rates found on many systems. Using failure data from the Astra supercomputer, we examine the efficacy of a simple method to determine if a sliding window of recent failures contains an unusual pattern of errors. Specifically, we investigate using Benford’s Law to predict the likelihood that the system is currently in a period of unusual failure occurrences. While still in its initial stages, this work provides critical analysis of failure status for extreme-scale systems and a simple form of prediction for determining when the scheduling of failure mitigation may be suboptimal and needs to be reevaluated due to the unusual pattern of errors that are occurring.
Back to 13th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2023) Archive Listing