Disk Failure Trends in Alpine Storage System

SC23 Proceedings

Workshops Archive

Disk Failure Trends in Alpine Storage System

Workshop: 13th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2023)

Authors: Anjus George, Jesse Hanley, and Sarp Oral (Oak Ridge National Laboratory (ORNL))

Abstract: Large-scale HPC systems demand extensive disk-based storage for data generated by HPC applications, necessitating scalable reliability, availability, and failure management. Extracted failure data from HPC storage offers valuable insights for preventing and managing failures, spanning understanding storage robustness, guiding system design and deployment, and creating durable data protection schemes. This paper introduces a failure dataset from OLCF’s Summit supercomputer's file system, Alpine, encompassing 4000+ events over 2.75 years from 32000+ disks. Before analysis, we delve into Alpine's components and introduce IBM Spectrum Scale technology, then assess collected data for failure distribution and burst correlations. We infer that, proximity to enclosure fan modules heightens disk failure rates. Also, burst failure analysis highlights 1/3rd of failures occurring in bursts, with 90% non-spatially correlated, impacting multiple racks.

Back to 13th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2023) Archive Listing

Back to Full Workshop Archive Listing