RUS  ENG
Full version
JOURNALS // Vestnik Yuzhno-Ural'skogo Universiteta. Seriya Matematicheskoe Modelirovanie i Programmirovanie // Archive

Vestnik YuUrGU. Ser. Mat. Model. Progr., 2010 Issue 6, Pages 91–103 (Mi vyuru231)

On program restoration from checkpoints set

A. Y. Polyakov

Rzhanov Institute of Semiconductor Physics, Siberian Branch of Russian Academy of Sciences, Novosibirsk

Abstract: In paper two approaches to distributed programs restore problem from checkpoints set are described. Computation node wide algorithm of parent-child relationships and group/session assignement recreation at restore time is proposed. Also coordinated algorithm for process set restoration from several nodes/terminals is designed. Described algorightms are implemented in checkpointing package called DMTCP (Distributed MultiThreaded CheckPointing).

Keywords: HPC, rollback-recovery, checkpointing, fault tolerance.

UDC: 004.451

Received: 16.04.2010



© Steklov Math. Inst. of RAS, 2024