0018-9340/03/$31.00 © 2003 IEEE
Published by the IEEE Computer Society
Guest Editorial: Special Issue on Reliable Distributed Systems
Designers of distributed systems are concerned with developing architectures, networking, software, algorithms, and applications. While research in this direction addresses some of the fundamental issues in distributed computing, topics related to modeling and simulation of multiple processor systems, real-time operation, reliability, fault tolerance, information assurance, performance measurements, and evaluation are also critical for the successful functioning of distributed systems. The purpose of this special issue is to serve researchers, designers, and implementers of distributed systems, with emphasis on system properties such as reliability, availability, and performability. In addition to conceptual advancement, this issue is intended to recognize the efforts that are aimed toward experimentation, testbeds, development, exploratory or emerging applications, and measurements from operational systems.
The theme of this special issue was made to coincide with the 19th IEEE Symposium on Reliable Distributed Systems held at Nuernberg, Germany, 2000, but the topics and submissions were not restricted to the proceedings of this symposium. We received a total of 55 submissions, of which we selected 11 regular papers. Every submission was sent to at least five referees. We received a total of 201 reviews back from 141 referees. Several papers that are not included in this special issue have been forwarded for consideration in the regular issues of this transaction.
Although we wanted to have a good mix of current, successful efforts, innovative ideas on reliable designs and open problems—both conceptual and experimental, the space limitation in the special issue and the type of submissions we received may have precluded some key topics of reliable distributed systems. Yet, we believe that the special issue encompasses major properties of a reliable distributed system. The selected papers are classified into four groups:
1. detection and recovery,
2. reliable communication and protocols,
3. information assurance, and
4. fault-tolerant systems and evaluation.
Many of these are areas of considerable growth in the years to come and we hope that this special issue will set the pace for the reliable design of new systems and emerging applications.
The first three papers are on the detection and recovery theme. The paper "Perfect Failure Detection in Timed Asynchronous Systems" by C. Fetzer advances the concept of unreliable failure detectors to timed asynchronous systems by enforcing perfect failure detection with hardware watchdogs. The implementation is done using a combination of off-the-shelf software and hardware. Authors N. Kandasamy, J.P. Hayes, and B.T. Murray in their paper "Transparent Recovery from Intermittent Faults in Time-Triggered Distributed Systems" propose a constructive approach to integrate runtime recovery policies so that a processor recovering from task failures does not disrupt the operation of other processors. Given the details of a system, they construct the corresponding fault-tolerant schedule with sufficient slack to accommodate recovery and show that their approach incurs only about 30-40 percent performance overhead compared to its non-fault-tolerant counterpart. K.F. Ssu, W.K. Fuchs, and H.C. Jiau extend their checkpointing and recovery ideas to heterogeneous computing environments in the paper "Process Recovery in Heterogeneous Systems." They describe a recovery approach and its implementation, called PREACHES, to provide portable checkpointing of single process applications using a checkpoint propagation mechanism.
The next three papers are on reliable real-time communication and the associated protocols. "Peer-to-Peer Membership Management for Gossip-Based Protocols" by A.J. Ganesh, A.M. Kermarrec, and L. Massoulié presents a scalable membership protocol, which operates in a fully distributed manner and provides each member with a partial view of the group membership. The authors show that the size of the partial views converges to the value needed to support a gossip algorithm in a reliable way. Their work is supported by a theoretical analysis and simulation of a basic protocol. The paper "Semantically Reliable Multicast: Definition, Implementation, and Performance Evaluation" by J. Pereira, L. Rodrigues, and R. Oliveira makes use of the notion of message obsolescence to develop a semantically reliable multicast protocol that is more resilient to transient performance perturbations of group members and enhance applications throughput. The protocol is studied analytically and verified by simulation and real experimentation. E. Nett and S. Schemmer address the real-time scheduling of shared resources in a wireless distributed system in their paper "Reliable Real-Time Communication in Cooperative Mobile Applications." By separating the application-specific scheduling part that is locally executed from the general-purpose communication core, the authors address the real-time cooperation of mobile autonomous systems.
Information Assurance is an emerging topic in a wide variety of commercial and military applications. The next two papers investigate the use of redundancy in building survivable systems and the availability of such systems when they come under attacks. M.A. Hiltunen, R.D. Schlichting, and C.A. Ugarte in "Building Survivable Services Using Redundancy and Adaptation" show how to provide intrusion-tolerant interprocess communication using redundant resources and reduce the chance of incidents that may compromise the entire system. The technique is made more robust by a new adaptability criterion wherein the software, as a reactive measure, modifies its runtime behavior, thereby complicating the attack and preventing it from succeeding. J. Xu and W. Lee in "Sustaining Availability of Web Services under Distributed Denial of Service Attacks" look at the distributed denial of service (DDoS) attacks that have plagued commercial systems in recent times. Their approach to provide availability of services under such attacks is to isolate and protect legitimate traffic from the DDoS traffic when an attack is detected. Cryptographic techniques are used to provide the primary protection to the legitimate traffic across the Internet.
The fault tolerance and evaluation category has three papers that address performance and dependability of multiprocessors, fault-tolerant execution of mobile agents, and provide experiences and lessons learned in the design of next generation fault tolerant systems. S. Pleisch and A. Schiper in "Fault-Tolerant Mobile Agent Execution" identify key properties of fault-tolerant mobile agent execution. They caution that the general approach of replication may lead to multiple execution of an agent, resulting in a violation of the exactly-once property of mobile agents and several undesirable side effects. They solve this problem by modeling fault-tolerant mobile agent execution as a sequence of agreement problems. M. Rabah and K. Kanoun propose a modeling framework for evaluating the dependability and performance measures of multiprocessor systems in "Performability Evaluation of Multipurpose Multiprocessor Systems: The 'Separation of Concerns' Approach." Their contribution is the explicit separation between the architectural and environmental concerns of a given system. The technique is illustrated for a multiprocessor architecture of 16 processors. The special issue concludes with a practical experience paper, "Reflective Fault-Tolerant Systems: From Experience to Challenges" by J.C. Ruiz, M.-O. Killijian, J.-C. Fabre, and P. Thévenod-Fosse. The fault-tolerant framework proposed here combines design and validation issues for the development of dependable reflective systems. Reflection is a property by which a component can observe and control its own structure and behavior from outside and can be used as an input to perform nonfunctional actions such as fault tolerance. The paper also identifies several new research challenges in this emerging field of fault tolerance.
Finally, we would like to express our thanks to the authors of all submitted papers and the referees for their outstanding review in a timely manner. This special issue would not have been possible without the support of Dr. Jean-Luc Gaudiot, IEEE TC Editor-in-Chief and Associate Editor Dr. Arun Somani, who stood behind our effort on this special issue on reliable distributed systems. We would also like to thank Silvano Chiaradonna who has done an excellent job of bookkeeping and coordinating the reviews and Suzanne Werner of IEEE TC for her help throughout this project.
Shambhu J. Upadhyaya
• S.J. Upadhyaya is with the Center of Excellence in Information Systems Assurance Research and Education, State University of New York, Buffalo, NY 14260. E-mail: email@example.com.
• A. Bondavalli is with the Faculty of Science, University of Firenze, Via Lombroso 6/17, 50134 Firenze, Italy. E-mail: firstname.lastname@example.org.
For information on obtaining reprints of this article, please send e-mail to: email@example.com, and reference IEEECS Log Number 117433.
Shambhu J. Upadhyaya
is an associate professor of computer science and engineering and director of the Center of Excellence in Information Systems Assurance Research and Education at the State University of New York at Buffalo. Prior to July 1998, he was a faculty member in the Electrical and Computer Engineering Department. His research interests are information assurance and security in distributed systems, fault diagnosis, fault-tolerant computing, and VLSI testing. His current projects involve information assurance techniques, system level test scheduling, and analog circuit diagnosis. His research has been supported by the US National Science Foundation, Rome Laboratory, and the US Air Force Office of Scientific Research. In May 1999, IBM Corporation sponsored a new Electronic Test and Design Automation Lab to support his teaching and research on VLSI testing. He has been awarded an IBM Faculty Partner Fellowship for the year 2000-2001 in recognition of his research accomplishments in the area of testing and fault tolerance. He served as the USA program chair of the IEEE Symposium on Reliable Distributed Systems, 2000, at Nuernberg, Germany. He was a coguest editor of a book entitled Mobile Computing: Implementing Pervasive Information and Communication Technologies
by Kluwer Academic Publishers which was released in July 2002. He is an associate editor of the IEEE Transactions on Computers
and a member of the editorial board of the International Journal on Reliability, Quality, and Safety Engineering
published by the World Scientific Publishers. He is a senior member of the IEEE and a member of the IEEE Computer Society.
is an associate professor of the Faculty of Science at the University of Firenze, Italy. Previously, he was a researcher at the Italian National Research Council, working at the CNUCE Institute at Pisa where he was responsible for the Dependable Computing Group. Since 1989, he has been working on dependable computing and has participated in many national projects and in the ESPRIT BRA 3092 PDCS, 6362 PDCS-2, and ESPRIT 27439 HIDE Projects funded by the European community. He has been scientifically responsible for the PDCC partnership in the ESPRIT 20716 GUARDS. He has authored or coauthored more than 80 papers which have appeared in international journals and proceedings of international conferences. He served as European program chair of the IEEE Symposium on Reliable Distributed Systems, 2000, at Nuernberg, Germany, and of IEEE HASE 2001 at Boca Raton, Florida. Currently, he is serving as program chair of EDCC-4 European Conference on Dependable Computing, 2002, Toulouse, France, of IEEE ISADS (International Symposium on Autonomous Decentralized Systems), 2003, Pisa, Italy. He will serve as the general chair of IEEE SRDS 2003, Florence, Italy. His current research interests include the design of dependable computing systems, software, and system fault tolerance and the modeling and evaluation of dependability attributes like reliability and performability. He is a member of the IEEE Computer Society, the IFIP W.G. 10.4 Working Group on Dependable Computing and Fault-Tolerance, ENCRESS Club Italy, and the AICA Working Group on Dependability in Computer Systems.