Resilient Programming Models for One-Sided Communications
People
Supervisor
Description
Checkpoint-restart is commonly used to provide resilience to fail-stop faults (e.g. node failures) for HPC applications. However, as mean-time-to-failure shortens with increasing system size, checkpoint-restart does not scale as it is not possible to checkpoint the entire system memory between failures.
Alternative models such as MPI User-Level Fault Mitigation [1] and Resilient X10 [2] have not addressed one-sided communication, which creates particular challenges for maintaining correctness and progress in the presence of process failures.
This work could start with a baseline of either MPI-3 [3] or GASNet [4] and define control flow, update semantics and recovery operations for resilient operation in the presence of arbitrary process failures.
Goals
Requirements
Background Literature
[1] Bland et al. (2012) An evaluation of user-level failure mitigation support in MPI
[2] Cunningham et al. (2014) Resilient X10: efficient failure-aware programming
[3] Gerstenberger, Besta, and Hoefler (2014) Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided
[4] GASNET: Global Address Space Networking