ARTÍCULO
TITULO

NR-MPI: A Non-stop and Fault Resilient MPI Supporting Programmer Defined Data Backup and Restore for E-scale Super Computing Systems

Suo Guang    

Resumen

Fault resilience has became a major issue for HPC systems, particularly, in the perspective of future E-scale systems, which will consist of millions of CPU cores and other components. MPI-level fault tolerant constructs, such as ULFM, are being proposed to support software level fault tolerance. However, there are few systematic evaluations by application programmers using benchmarks or pseudo applications. This paper proposes NR-MPI, a \emph{N}on-stop and Fault \emph{R}esilient \emph{MPI}, supporting programmer defined data backup and restore. To help programmers write fault tolerant programs, NR-MPI provides a set of friendly programming interfaces and a state transition diagram for data backup and restore. This paper focuses on design, implementation and evaluation of NR-MPI. Specifically,this paper puts emphases on failure detection in MPI library, friendly programming interface extending for NR-MPI and examples of fault tolerant programs based NR-MPI. Furthermore, to support failure recovery of applications, NR-MPI implements data backup interfaces based on double in-memory checkpoint/restart. We conduct experiments with both NPB benchmarks and Sweep3D on TH supercomputer in NSCC-TJ. Experimental results show that NR-MPI based fault tolerant programs can recover from failures online without restarting, and the overhead is small even for applications with tens of thousands of cores.

 Artículos similares

       
 
Wanlu Zhu, Tianwen Gu, Jie Wu and Zhengzhuo Liang    
In instances where vessels encounter impacts or other factors leading to communication impairments, the status of electrical equipment becomes inaccessible through standard communication lines for the controllers. Consequently, the shipboard power system... ver más

 
Jiawen Li, Yujia Wang, Haiyan Li, Xing Liu and Zhengyu Chen    
Ocean currents, mechanical collisions and electronic damage can cause faults in an autonomous underwater vehicle (AUV), including sensors and thrusters. For such problems, this paper designs a fault-tolerant controller that is independent of the results ... ver más

 
Xiaoli Pan, Zheping Yan, Heming Jia, Jiajia Zhou and Lidong Yue    
Formation control, which is a core problem in multi-autonomous underwater vehicle (AUV) systems, plays an important role in realizing safe and accurate cooperation of multi-AUV systems. This paper provides a study on fault-tolerant formation control for ... ver más

 
Lei Zhang, Jin Mu, Hongtu Ma, Guicheng Dai and Shengxi Tong    
General aviation is an important branch of the aviation field. As a green energy aircraft, the electric aircraft is an important component and development direction of general aviation aircraft, and its safety is crucial. In this paper, the aerodynamic a... ver más
Revista: Aerospace

 
Seyed Mohammad Hashemi, Seyed Ali Hashemi, Ruxandra Mihaela Botez and Georges Ghazi    
This paper presents a methodology for designing a highly reliable Air Traffic Management and Control (ATMC) methodology using Neural Networks and Peer-to-Peer (P2P) blockchain. A novel data-driven algorithm was designed for Aircraft Trajectory Prediction... ver más
Revista: Aerospace