Projekte pro Jahr
Abstract
Resilience is an important research topic in HPC. As computer clusters go to extreme scales, work in this area is necessary to keep these machines reliable.
In this work, we introduce a generic method to endow iterative algorithms in linear algebra based on sparse matrixvector products, such as linear system solvers, eigensolvers and similar, with resilience to node failures. This generic method traverses the dependency graph of the variables of the iterative algorithm. If the iterative method exhibits certain properties, it is possible to produce an exact state reconstruction (ESR) algorithm, enabling the recovery of the state of the iterative method in the event of a node failure. This reconstruction is exact, except for small perturbations caused by floating point arithmetic. The generic method exploits redundancy in the matrixvector product to protect the vector that is the argument of the product.
We illustrate the use of this generic approach on three iterative methods: the conjugate gradient method, the BiCGStab method and the Lanczos algorithm. The resulting ESR algorithms enable the reconstruction of their state after a node failure from a few redundantly stored vectors.
Unlike previous work in preconditioned conjugate gradient, this generic method produces ESR algorithms that work with general matrices. Consequently, we can no longer assume that local diagonal submatrices used to reconstruct vectors are nonsingular. Thus, we also propose an approach for deriving nonsingular local linear systems for the reconstruction process with reduced condition numbers, based on a communicationavoiding rankrevealing QR factorization with column pivoting.
Originalsprache  Englisch 

Titel  Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale 
Seiten  4150 
DOIs  
Publikationsstatus  Veröffentlicht  2020 
ÖFOS 2012
 102023 Supercomputing
Projekte
 1 Abgeschlossen

REPEAL: Resilience versus Performance in Numerical Linear Algebra
Gansterer, W. & FrolikSteffan, U.
1/03/16 → 31/08/20
Projekt: Forschungsförderung