Date of Award
Master of Science
In this work we have addressed the complex problem of recovery for concurrent failures in a distributed computing environment. We have proposed a new checkpointing and recovery approach that enables each process to restart from its recent checkpoint and therefore guarantees least amount of recomputation to be done after recovery. The proposed new approach deals effectively with orphan and lost messages. We have introduced two new ideas. The value of the common checkpointing interval is such that it requires to log only the messages sent in the recent checkpoints of the processes. The lost messages are always determined a priori by the initiator process in parallel to the normal distributed computation. Thereby, it does not delay the recovery approach in anyway.
This thesis is only available for download to the SIUC community. Others should
contact the interlibrary loan department of your local library.