NSF Research Grant
(May 1998 - May 2001)

A Theory of Fault-Tolerant Adaptive and Minimal Routing in Mesh-Connected Multicomputers

This project aims to develop a general theory of fault-tolerant adaptive and minimal routing in k-ary n-dimensional meshes. This theory can be extended to k-ary n-dimensional hypercubes which are meshes with wraparounds. This theory is based on a fault model and an information model proposed in this project. Unlike many existing fault models, the proposed fault model can handle boundary faulty nodes easily without disabling too many non-faulty nodes. As opposed to some of the existing fault-tolerant routing approaches where fault information is carried by the header of the routing message, the proposed approach is based on coded neighborhood information maintained at each processor. Each node in a mesh-connected multicomputer collects and distributes fault information concurrently but in a decentralized way. This collection and distribution process exhibits desirable properties of self-stabilizing, self-optimizing, and self-healing. Unlike many existing information models that require each node to have knowledge of the entire system, the coded fault information associated with each node represents limited-global information by exploring the locality of disturbances in the system. Locality of fault information also ensures the scalability of the proposed approach. Insights gained in this project will shed new light on constructing fault-tolerant, adaptive, minimal, and deadlock-free routing in mesh-connected multicomputers.

Contador    Since July 2003