RUS  ENG
Full version
JOURNALS // Program Systems: Theory and Applications // Archive

Program Systems: Theory and Applications, 2011 Volume 2, Issue 3, Pages 17–28 (Mi ps39)

This article is cited in 1 paper

Hardware, software and distributed supercomputer systems

T-Sim fault tolerance

E. Î. Tyutlyaevaa, A. A. Moskovskiib

a Program Systems Institute of RAS, Pereslavl'-Zalesskii, Yaroslavskaya obl.
b RSC SKIF, Pereslavl'-Zalesskii

Abstract: This paper addresses fault-tolerance challenges in distributed computing environment. Increasing scalability of modern computational clusters leads to an increasing probability of an interrupt occuring. In a number of cases computational algorithms, such as genetic algorithms, Monte Carlo based algorithms, have the mathematical properties that they get the correct answer despite the occurrence of faults in the system. This paper proposes methods for implementation such class of algorithms despite software and hardware faults. Some example of monotonous reducing object is implemented using C++ template class library T-Sim. Moreover, some test realizations are implemented.

Key words and phrases: Fault-tolerance, T-Sim C++ template library, monotonous object, local synchronization.

UDC: 004.052.3



© Steklov Math. Inst. of RAS, 2024