Design and Validation of Portable Communication Infrastructure for Fault-Tolerant Cluster Middleware
Yuval Tamir, Ming Li, Wenchao Tao, Daniel Goldberg, Israel Hsu

We describe the design, implementation, and validation of the communication infrastructure (CI) for fault-tolerant cluster middleware. The CI supports and provides optimized interfaces for two distinct classes of communication: for the applications and for the cluster management middleware. The CI is designed for portability and for efficient operation on top of modern user-level message passing mechanisms. We present a functional fault model for the CI and show how platform-specific faults map to this fault model. We have developed a fault injection scheme based on the fault model. This scheme is integrated with the CI and is thus portable across different communication technologies. The CI and the associated fault injection scheme have been implemented as part of a larger fault-tolerant cluster project. The fault injection scheme is used to validate and evaluate the implementation of the CI itself as well as the cluster management middleware in the presence of communication faults.