Spatial Support Vector Regression to Mitigate Silent Errors in the Exascale Era

Speaker: Leonardo Bautista Gomez (ANL)
Date: Thursday, 3 December 2015, 10:30-12:00
Session: Resilience I
Talk type: Project talk (30 min)

Abstract: As the Exascale era approaches, the increase in the capacity of high performance computing (HPC) systems and the targeted power and energy budget goals for these systems cause challenges in terms of reliability. In particular, silent data corruptions (SDC) or silent errors corrupt the results of HPC applications without being noticed. Consequently they become a significant threat to the correct computations of these applications. In this work, we re-purpose and redesign epsilon-insensitive support vector machine regression to detect and correct SDCs that occur in HPC applications which can be characterized by an error bound. Our design takes spatial features, i.e. neighboring data values, into training data and as a result incurring low memory overhead. Experimental results show that our detector achieves more than 90% recall and less 1% false positive rate for most of the cases. Moreover our detector incurs low performance overhead for all benchmarks studied. Comparison to other state-of-the-art techniques indicates that our detector provides the best trade-off considering its performance and the incurred overheads.

Last Modified: 18.11.2022