Fault Tolerance in Distributed Systems: Global States and Checkpointing
- 호서대학교 공업기술연구소
- 공업기술연구 논문집
- 제16권 제1호
-
1997.12481 - 498 (18 pages)
- 0
Due to the autonomous processor behavior and arbitrary communication delays, any single processor in a distributed system cannot capture the complete system state instantaneously. Therefore, gathering process—state information in different processors and channel states may be required to solve many problems in distributed systems. An algoritnm for gathering imormation from the whole system is called a global state detection algorithm ; information gathered by such an algorithm is called a global state. This paper describes the classification of global states, detection algorithms of global states, and fault-tolerant schemes based on coordinated checkpointing. The coordinated checkpointing establishes useful global states.
1. Introduction
2. Space Time Model of Distributed Computations
3. The Classification of Global States
4. Global State Detection Algorithms
5. Checkpointing Strategy
6. Requirements for Efficient Checkpointing and Recovery
7. Tightly Coordinated Checkpointing and Recovery
8. Loosely Coordinated Checkpointing and Recovery
9. Conclusion
(0)
(0)