Centralized system management is becoming increasingly difficult and thus should inevitably rely on autonomic computing. However, how might we incorporate fault tolerance into the autonomic-computing environment?
Information systems are becoming more and more complicated. First, most of today's information systems are networked—stand-alone systems without networking no longer make sense.
Second, information systems are growing significantly in size. They include countless resources (such as computing and storage units) and communication facilities, which may be allocated over networks.
Third, the territory that information systems cover is becoming obscure, making them boundaryless. Because resources connect to networks and are shared through, say, the Internet across enterprises, a networked information system's scale and configuration can change dynamically, even during operation. Furthermore, because of these systems' giant scale and complicated configuration, we can no longer overlook hardware or software malfunctions. Damage incurred by such malfunctions could spread over networks, affecting almost all aspects of society.
Finally, information systems are exhibiting concurrent evolution. Large information systems often concurrently add and alter hardware equipment and facilities as well as functions, with nonstop service provision. So, a system can expand even during normal operation.
These trends are accelerating to the point where they'll soon pose serious challenges to system management. System management monitors system performance and adapts and optimizes its parameters to dynamic system changes. However, owing to the systems' boundaryless nature and continuous expansion, we might not be able to employ centralized system management in such unsteady environments. We thus need to rely on autonomic computing.
Autonomic computing offers computing units distributed over networks, performs cooperation and collaboration among computing units, and has no centralized management mechanism. A typical example is grid computing.
1 Furthermore, autonomic computing can share computing power with different sites, even across enterprises.
However, what must we consider if we hope to incorporate fault tolerance—crucial to system operation—into the autonomic-computing environment?
Consider the operational model of autonomic computing
1 in
Figure 1 a. Computing service providers such as a miner factory, a database factory, and database service providers are distributed over the (broadband) Internet. In addition, local community registries are allocated in a distributive way.
Figure 1. (a) An operational model of autonomic computing (the numbers correspond to operations discussed in the main text); (b) fault tolerance, showing triplication of the registry and duplication of the service providers. Assume a user wants to create a specific database. Autonomic computing would perform the following operations in response to this request (see
Figure 1 a):
1. The user asks the community registry (to which he or she belongs) about available computing service providers that meet the user's requirements (such as cost, location, and performance).
2. The registry returns IDs of available and healthy (fault-free) providers, referring to the health table it maintains. Of course, faulty providers are excluded at this time. The registry must maintain and update the health and functionality information of each the community's service providers. If necessary, the community registry can relay a user's request to other remote registries.
3. The user issues requests to the indicated providers, specifying detailed conditions such as which operations to perform, which forms should keep the results, and initial lifetimes (duration of the user's or application's interest in pursuing the target).
4. After successfully negotiating the initiating communication, the service providers create service instances, invoking necessary operations with appropriate initial states, resources, lifetimes, and so forth. As I mentioned before, faulty providers don't participate in the services. The service providers notify the requesting user when normal operations are complete.
5. The operation invoked at the service providers may initiate queries to appropriate remote functions. These queries may be carried out alternatively through the community registry.
6. Remote functions return results to an appropriate service provider.
7. Meanwhile, depending on the initially negotiated lifetimes, the user might need to issue periodic keep-alive messages to the service providers to indicate continued interest.
Now let's consider how we'd incorporate fault tolerance into this environment.
The community registry plays an important role in incorporating fault tolerance. As I described earlier, the registry must maintain a table (information) that specifies which community members (that is, service providers associated with the registry) are healthy and ill (faulty). The registry must know when a community member is faulty. This is problematic.
First, how does the registry find an abnormality? A faulty member could somehow sound an alarm that the registry would recognize. The registry could then continuously monitor each community member's health, provided that steady links (or lines) exist between the registry and each of its community members. However, providing a dynamically changing autonomic-computing environment with steady links is difficult. Instead of continuous monitoring, the registry should rely on periodic checks by issuing inquiries or sending mobile agents to gather the information. But then the health table isn't always updated, and the registry could give a user improper information. (As I describe later, initial negotiation between the user and service providers could help solve this problem.)
Second, when a fault occurs in the registry, how should we deal with it? One way is to prepare a spare copy of the registry. The faulty registry might indicate its abnormality using some error-detecting mechanism. Then, the user could query the spare registry. To realize this, each service provider would need to make multiple registrations of both the registry and its spare. Furthermore, the spare registry must also maintain and update the health table. Then, the community's structure and necessary procedures would become prohibitively intricate. In this case, we'd need to consider implementing strong fault tolerance of the registry itself. If necessary, each registry should exploit triplication and majority voting.
In an autonomic-computing environment, we assume that numerous service providers exist with similar functionality over the network. So, unlike the case of the registry, we could easily find healthy service providers that could take over faulty providers. When a requested service provider isn't healthy, the registry suggests an alternative provider. Therefore, we need duplicated service providers or a sufficient number of service providers with similar functionality (not necessarily the same) in the community area. In this sense, we might not need strong fault tolerance for service providers; instead, we could apply a simple replacement method. An error-detection mechanism is necessary in any case.
Figure 1 b shows the triplication for the registry and the duplication for the service providers.
Notification of an Abnormality
To replace a faulty service provider, we must first recognize the abnormality. However, as I already noted, service providers have difficulty continuously monitoring each other using steady lines when network configurations dynamically change. We should thus rely on inquiries sent through networks to check service providers in an autonomic-computing environment. When initiating communication, a requesting provider could discover an abnormality by noticing an alarm or noting abnormal behavior in the communication protocol. So, we could readily recognize a service provider's abnormality when it's idle.
A more serious problem in autonomic computing is how a requesting provider recognizes the requested provider's abnormality if a fault occurs after the initial communication or during service operation. Could the requested provider warn others of the fault if operating under an abnormal condition? Furthermore, the requested providers might not respond to the inquiry properly. In some cases, watch-dog timers might be useful but still not enough.
So, we should implement fault-tolerant communication as well as robust exception handling in each service provider. Alternatively, we must find a way for service providers to notify others of their abnormality without relying on the service provider's operation.
Consensus Among Neighbors
Assume that service providers constitute a group of neighbors. When a service provider behaves abnormally, its healthy neighbors assume that something is wrong. They conclude that such a service provider isn't their counterpart and exclude it from the group.
More precisely, let's say a group consists of three service providers of the same functionality (see
Figure 2 a). These neighbors aren't necessarily close to each other geographically—they might be remote in the network. Each receives the same request, and they operate in a loose synchronism (using software or through some other appropriate way). So, each neighbor computes the same result when all service providers are healthy. Also, they exchange their results so that each neighbor knows whether they all show the same result.
Figure 2. (a) Consensus among service providers and (b) renewing a group after identifying a faulty service provider. Now let's assume that one group member develops faults. Then, at least two healthy neighbors reach the same conclusion—this is the consensus. All three members will transfer their results to the triplicate members of the next group or to a single user. Because each service provider in the next group or a single user receives the computational result threefold, each member of the group or the user can determine the correct result by some adjudication—say, majority voting. Therefore, the faulty service provider doesn't need to notify others of its abnormality. Furthermore, this wouldn't require a fault tolerance measure, except some mechanism for adjudicating the results. This is one advantage of the consensus approach.
The service provider recognized as faulty would be excluded from the group, and its neighbors would invite a new service provider to join them (see
Figure 2 b).
2 Because a message transfers between two successive groups, each with triplicate members, at most ninefold communications of a message exist. This is obviously a problem. Because cost penalty in communication is not negligible, the consensus approach should be employed in mission-critical applications where fault tolerance is particularly important and faster real-time operations are required.
Duplication and recovery could reduce the communication quantity. Let's say a group comprises two service providers (see
Figure 3 a). The same operation will be performed in the duplicated service providers with some synchronism. Each service provider in a Group A will send its computational result to each service provider in a Group B. So, each service provider in Group B receives the duplicated results and compares them.
When a member of Group B detects a discrepancy between the paired results, it warns its counterpart, if necessary (see
Figure 3 b). The member of the group then requests a service provider of Group C, which precedes two stages to Group B to give its result to the third service provider (different from those of Group A) and to invoke the same operation there. The third service provider sends the result of its computing to each Group B service provider. Then, each Group B service provider gets the computational result threefold and restores the correct one using majority voting.
Figure 3. Duplication and recovery (a) during normal operation and (b) when an abnormality occurs. Making this approach available would require a technique for maintaining a computation history—for example, noting where the computation occurred and what service providers invoked it. Such information should be transferred to service providers in the succeeding group or maintained in the registry.
3 Tolerance to Design Faults and Intrusion
Most software malfunctions are caused by design faults; another advantage of duplication and recovery is the possibility of alleviating such faults. If we could have a variety of service providers of different hardware and software implementations, we could benefit from the design diversity in the duplicated operation. That is, a design fault might also manifest itself in the discrepancy between the paired results, as with hardware faults. The design fault could be recovered by design diversity together with duplication and recovery.
System intrusion is another serious threat. The Code Red and Sircam worms caused an estimated US$3.9 billion of damage (compare this to the US$15.8 billion estimated for restoring IT and communication facilities lost during the 9/11 attack).
4 Furthermore, network-based processing is intrinsically vulnerable to such attacks. We must seriously consider the danger of intrusion. And because we can consider intrusion to be an intentional and malicious fault, realizing intrusion-tolerant systems is urged even in the context of fault tolerance.
5 Encryption techniques can do little against intrusion. Design and geographical diversity, together with duplication and recovery, might be one effective and realistic way to detect intrusion and remove its adverse effects.
If we look at the distribution of an extremely large number of computing service providers, the duplication and recovery idea seems feasible in the near future.
Learning From Different Disciplines and Past Research
Human societies in which individuals live rather independently exemplify autonomic computing. In such societies, proper communication helps maintain stability and encourage cooperation. Then, is it enough to establish a standardized communication protocol among service providers in autonomic-computing environments?
Systems theory says that a group of rather independent members cooperate well with each other when a mechanism exists for clearly defining the group benefit and for rewarding members who cooperate. To make autonomic computing successful, we must provide such a mechanism for autonomic-computing environments. So, we should learn different disciplines, such as autonomous robots or sociology, to develop effective and efficient ways of performing user applications in the autonomic computing environment.
We could also benefit from reviewing past studies. Essentially, message-driven processing in autonomic computing is the same as the operation of dataflow machines, studied actively in the 1980s. We could find new possibilities for fault tolerance in autonomic-computing environments by looking back at the techniques of dataflow machines.
3 The operational model of
Figure 1 is a compromise between centralized and completely distributed approaches. This is because the function the registry performs could be regarded as a kind of central management, while operations of services are distributed over service providers. When an individual cooperates daily with others in society, he or she does not ask an organization like the registry whether to participate but rather makes his or her own decisions. Ultimately, the registry's function should also be distributed over service providers.
We have yet to determine the exact solution to incorporating fault tolerance into autonomic computing, but, as I've shown, many possibilities exist.
References
- [1] I. Foster , et al., "Grid Services for Distributed System Integration," Computer, vol. 35, no. 6, June 2002, pp. 37-46; http://csdl.computer.org/comp/mags/co/2002/06/r6037abs.htm.
- [2] Y. Yoneda , et al., "Fault Diagnosis and System Reconfiguration of Fault-Tolerant System Based on Majority-Voting," Trans. Inst. Electronics and Comm. Engineers of Japan (IECE), vol. J67-D, no. 7, July 1984, pp. 737-744 (in Japanese).
- [3] Y. Tohma, M. Chinuki,, and K. Kimura, , Experiment of Recovery Mechanism of Data-Flow Machines, tech. report FTS89-13, Inst. Electronics, Information and Communication Engineers of Japan, 1989.
- [4] A. Housholder, et al., "Computer Attack Trends Challenge Internet Security," Security and Privacy (supplement to Computer, vol. 35, no. 4), Apr. 2002, pp. 5-7; www.computer.org/security/supplement1/hou/index.htm.
- [5] R.A. Kemmerer and G. Vigna, "Intrusion Detection: A Brief History and Overview," IEEE Security and Privacy (supplement to Computer, vol. 35, no. 4), Apr. 2002, pp. 27-30; www.computer.org/security/supplement1/kem/index.htm.
Yoshihiro Tohma is a president at Tokyo Denki University. His research interests include fault tolerance, dependable computing, software reliability evaluation, and security. He received his PhD in electrical engineering from Tokyo Institute of Technology. He is a life fellow of the IEEE. Contact him at Tokyo Denki University, 2-2 Kanda-Nishikicho, Chiyoda-ku, Tokyo 101-8457, Japan; tohma@sie.dendai.ac.jp.