A Theory of Fault-Tolerant Adaptive and Minimal Routing
in Mesh-Connected Multicomputers
This project aims to develop a general theory of fault-tolerant adaptive
and minimal routing in k-ary n-dimensional meshes. This theory can be
extended to k-ary n-dimensional hypercubes which are meshes with wraparounds.
This theory is based on a fault model and an information model proposed
in this project. Unlike many existing fault models, the proposed fault
model can handle boundary faulty nodes easily without disabling too many
non-faulty nodes. As opposed to some of the existing fault-tolerant routing approaches
where fault information is carried by the header of the routing message,
the proposed approach is based on coded neighborhood information maintained
at each processor. Each node in a mesh-connected multicomputer collects
and distributes fault information concurrently but in a decentralized
way. This collection and distribution process exhibits desirable properties
of self-stabilizing, self-optimizing, and self-healing. Unlike many existing
information models that require each node to have knowledge of the entire
system, the coded fault information associated with each node represents
limited-global information by exploring the locality of disturbances in the
system. Locality of fault information also ensures the scalability of
the proposed approach. Insights gained in this project will shed new light
on constructing fault-tolerant, adaptive, minimal, and deadlock-free routing
in mesh-connected multicomputers.
Since July 2003
|