Skip to main content
Digital Experience
Schedule
Dates & Deadlines
Toggle navigation
Toggle navigation
Program
Dropdown menu toggle
Program
Schedule
Keynote
I Am HPC Plenary
Invited Talks
Panels
Workshops
Tutorials
Papers
Reproducibility Initiative
AD/AE Process & Badges
Awards
Birds of a Feather
Early Career
Exhibitor Forum
Posters
ACM SRC
Doctoral Showcase
Research Posters
SciViz Showcase
Job Fair
Receptions
Exhibits
Dropdown menu toggle
Exhibits
Exhibitor Prospectus
Exhibitor Application
Exhibitor List & Floorplan
Exhibitor Manual
Exhibitor Forum
Exhibitor Housing
Exhibitor Function Space
SCinet for Exhibitors
HPC Illuminations Pavilion
Quantum Village
Promotional Opportunities
Recruit at the Job Fair
Students
Dropdown menu toggle
Students@SC
Lead Student Volunteers
Student Volunteers
Student Cluster Competition
IndySCC
Mentor–Protégé Matching
HPC Immersion
Alumni Networking Event
Speed Mentoring Event
Guided Interest Groups
Teach the Teacher
Student Tours
Job Fair
SCinet
Dropdown menu toggle
SCinet
SCinet Technology
SCinet Teams
WINS
Network Research Exhibition
INDIS Workshop
Participate in SCinet
Contributors & Volunteers
SCinet for Exhibitors
SC Network Policy
Media
Dropdown menu toggle
Media
Media Registration
Media Partners
Blog
Newsletter
Photos & Logos
Attend
Dropdown menu toggle
Attend
Registration
Visa Applications
Digital Experience
Schedule
Denver
Convention Center
Housing
Family Resources
Inclusivity
Code of Conduct
Volunteer
Search
Search
Home
Presentation
Presentation
Full Schedule
·
Contributors
·
Organizations
·
Search
Program
Fault-Tolerance for High-Performance and Big Data Applications: Theory and Practice
Description
Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance and big data applications, with a fair balance between theory and practice. This tutorial is organized across four main topics:
(i) Overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) General-purpose techniques, which include several checkpoints and rollback recovery protocols, replication, prediction, and silent error detection;
(iii) Application-specific techniques, such as user-level in-memory checkpointing, data replication (map-reduce), or fixed-point convergence for iterative applications (back-propagation);
(iv) Practical deployment of fault tolerance techniques with User Level Fault Mitigation (MPI standard extension). Relevant examples will include widely used routines such as Monte-Carlo methods, SPMD stencil, map-reduce, and back-propagation in neural networks.
A step-by-step approach will show how to protect these routines and make them fault-tolerant, using a variety of techniques, in a hands-on session.
The tutorial is open to all SC23 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.
Presenters
George Bosilca
University of Tennessee
Aurélien Bouteiller
University of Tennessee
Thomas Herault
University of Tennessee
Yves Robert
ENS Lyon
Univ. Tennessee Knoxville
Event Type
Tutorial
Time
Sunday, 12 November 2023
8:30am
-
5pm
MST
Location
405
Next Presentation
Next Presentation
Efficient Distributed GPU Programming for Exascale
Back To Top Button