IEEE TPDS Special Issue (October, 1999)
Challenges in Designing Fault-Tolerant Routing in Networks
Co-Guest-Editors: Jie Wu and Dharma P. Agrawal
There are several means of interconnecting the components of a computer
system. They are direct connection and indirect connection. In a direct
network, channels directly connect each processor. Neighboring nodes communicate
with each other by sending messages across the channels. In an indirect
network, processors are connected through switches and/or buses. Depending
on how components are connected, we have bus-based networks, router-based
networks, switch-based networks, and cluster-based (hybrid of the other
three networks) networks. Another classification can be based on applications,
we have local area networks (LANs), network of workstations (NOWs), multiprocessors,
and mobile networks.
In a computer network, the efficiency of a routing process is critical
to the system performance. A routing process deals with moving data between
processors in a given network. When such a process involves more than
one source and/or one destination, it is called a collective communication
process.
More recently, the low cost network of workstations (NOWs) have become
more or less accepted fact as a replacement of expensive supercomputers.
Routing in NOWs are done primarily using a large LAN. Thus, routing in
such networks have become increasingly critical for high-performance computing
applications. The router design and routing techniques are altogether
different for NOWs and LANs as compared to conventional multiprocessors.
Support for PVM-type message communication mechanisms and their effective
use have become increasingly important. Such schemes are even much more
different in mobile networks because the signals have to pass through
the noisy atmosphere. Also, conversion from medium access control in the
air to ATM-type protocol for the backbone network has become a necessity
in the communication world.
As the number of processors in a network increases, the probability of
processor failure also increases. Therefore, it is important to design
a routing process with fault tolerance capability to guarantee a successful
routing in the presence of faults. However, techniques used to achieve
fault tolerance are often at the expense of considerable performance degradation
and reduction in adaptivity. In some other cases, a fault-tolerant routing
is subject to deadlock and livelock if it is not carefully designed. Virtual
channels and virtual networks can be used to maintain certain degree of
routing adaptivity and eliminate the deadlock and livelock situation.
However, these mechanisms complicate the routing process and increase
the circuit complexity of the router.
Fault-tolerant routing techniques can also be classified into hardware-based
and software-based. In a hardware-based fault tolerant routing, fault-tolerant
routers are customized to support a selected approach, such as the addition
of virtual channels and enforcement of routing restrictions. However,
when the fault rates are relatively low, a software-based approach is
more justifiable. In addition, the information on fault distribution also
plays an important role in fault-tolerant routing. Most of these models
assume that each node knows either only the neighbors' status or the status
of all the nodes. A model that uses the former assumption is called local-information-based,
while a model that uses the later assumption is called global-information-based.
Normally, a global-information-based model can obtain an optimal or suboptimal
result; however, it requires a complex process that collects global information.
A coded-information-based model has been used which is a compromise between
local-information and global-information-based approaches.
Fault-tolerant routing also depends on many other factors, such as switching
techniques, type and nature of faults, network topology, and system port
models. It also involves a set of design choices: special vs. general
purpose, minimal vs. nonminimal, deterministic vs. adaptive, and redundant
vs. non-redundant.
The goal of special issue is to put together some of recent results on
fault-tolerant routing in networks. Researchers and practitioners working
in this area have an opportunity to discuss and expression their views
on the current trends, challenges, and state-of-art solutions to design
fault-tolerant routing.
Since July 2003
|