IEEE TPDS Special Issue (October, 1999)

Challenges in Designing Fault-Tolerant Routing in Networks
Co-Guest-Editors: Jie Wu and Dharma P. Agrawal

There are several means of interconnecting the components of a computer system. They are direct connection and indirect connection. In a direct network, channels directly connect each processor. Neighboring nodes communicate with each other by sending messages across the channels. In an indirect network, processors are connected through switches and/or buses. Depending on how components are connected, we have bus-based networks, router-based networks, switch-based networks, and cluster-based (hybrid of the other three networks) networks. Another classification can be based on applications, we have local area networks (LANs), network of workstations (NOWs), multiprocessors, and mobile networks.

In a computer network, the efficiency of a routing process is critical to the system performance. A routing process deals with moving data between processors in a given network. When such a process involves more than one source and/or one destination, it is called a collective communication process.

More recently, the low cost network of workstations (NOWs) have become more or less accepted fact as a replacement of expensive supercomputers. Routing in NOWs are done primarily using a large LAN. Thus, routing in such networks have become increasingly critical for high-performance computing applications. The router design and routing techniques are altogether different for NOWs and LANs as compared to conventional multiprocessors. Support for PVM-type message communication mechanisms and their effective use have become increasingly important. Such schemes are even much more different in mobile networks because the signals have to pass through the noisy atmosphere. Also, conversion from medium access control in the air to ATM-type protocol for the backbone network has become a necessity in the communication world.

As the number of processors in a network increases, the probability of processor failure also increases. Therefore, it is important to design a routing process with fault tolerance capability to guarantee a successful routing in the presence of faults. However, techniques used to achieve fault tolerance are often at the expense of considerable performance degradation and reduction in adaptivity. In some other cases, a fault-tolerant routing is subject to deadlock and livelock if it is not carefully designed. Virtual channels and virtual networks can be used to maintain certain degree of routing adaptivity and eliminate the deadlock and livelock situation. However, these mechanisms complicate the routing process and increase the circuit complexity of the router.

Fault-tolerant routing techniques can also be classified into hardware-based and software-based. In a hardware-based fault tolerant routing, fault-tolerant routers are customized to support a selected approach, such as the addition of virtual channels and enforcement of routing restrictions. However, when the fault rates are relatively low, a software-based approach is more justifiable. In addition, the information on fault distribution also plays an important role in fault-tolerant routing. Most of these models assume that each node knows either only the neighbors' status or the status of all the nodes. A model that uses the former assumption is called local-information-based, while a model that uses the later assumption is called global-information-based. Normally, a global-information-based model can obtain an optimal or suboptimal result; however, it requires a complex process that collects global information. A coded-information-based model has been used which is a compromise between local-information and global-information-based approaches.

Fault-tolerant routing also depends on many other factors, such as switching techniques, type and nature of faults, network topology, and system port models. It also involves a set of design choices: special vs. general purpose, minimal vs. nonminimal, deterministic vs. adaptive, and redundant vs. non-redundant.

The goal of special issue is to put together some of recent results on fault-tolerant routing in networks. Researchers and practitioners working in this area have an opportunity to discuss and expression their views on the current trends, challenges, and state-of-art solutions to design fault-tolerant routing.

Contador    Since July 2003