Fault Tolerance in Distributed Systems: Global States and Checkpointing

Due to the autonomous processor behavior and arbitrary communication delays, any single processor in a distributed system cannot capture the complete system state instantaneously. Therefore, gathering process—state information in different processors and channel states may be required to solve many problems in distributed systems. An algoritnm for gathering imormation from the whole system is called a global state detection algorithm ； information gathered by such an algorithm is called a global state. This paper describes the classification of global states, detection algorithms of global states, and fault-tolerant schemes based on coordinated checkpointing. The coordinated checkpointing establishes useful global states.

1. Introduction

2. Space Time Model of Distributed Computations

3. The Classification of Global States

4. Global State Detection Algorithms

5. Checkpointing Strategy

6. Requirements for Efficient Checkpointing and Recovery

7. Tightly Coordinated Checkpointing and Recovery

8. Loosely Coordinated Checkpointing and Recovery

9. Conclusion

Fault Tolerance in Distributed Systems: Global States and Checkpointing

(0)

(0)

(0)

(0)

Fault Tolerance in Distributed Systems: Global States and Checkpointing

(0)

(0) 팝업 열기 팝업 닫기

(0)

(0)

(0)