In the paper "Slicing Distributed Systems," Gramoli et al. consider the problem of slicing—the partitioning of nodes into subsets using an one-dimensional attribute—in distributed systems. This problem arises in peer-to-peer networks and other popular distributed systems used in today's large data centers. Slicing can help autonomic systems shift work dynamically among nodes and react dynamically to nodes dropping from the system or new nodes joining. The paper reviews existing approaches such as sorting and ranking and shows their problems in convergence, resulting in slow adaptation to changed conditions or inaccurate slice assignments in nonstatic networks. The authors introduce Sliver, a simple distributed slicing protocol that samples attribute values from the network and estimates the slice index from the sample. The theoretical results show that Sliver has provably rapid conversion, is robust under stress, and is simple to implement which they also show with simulation results of the new protocol.
The paper "Cholla: A Framework for Composing and Coordinating Adaptations in Networked Systems" by Bridges et al. describes Cholla, a software architecture that separates the policy decisions of how and when adaptive components in networked systems react to their environment into separate centralized controllers that are constructed from composable rule sets.
A major goal of this work is to develop a controller architecture that can be used for any collection of adaptive networking components, not a specific architecture that has to be implemented for each application-specific configuration or reimplemented each time that configuration changes. To achieve this, controllers are designed to be composable, with each adaptive component in the system contributing logic that together implements the overall functionality of the controller. This logic is implemented as rule sets that describe particular adaptation policies, and allow rule sets that implement linear, heuristic, and fuzzy control policies. Using composable control logic allows the controller to be specialized based not only on the set of adaptive components in the system, but also on the target network architecture and application requirements.
This architecture consists of adaptable components, composable adaptation controllers, and a runtime system. In addition to describing the architecture of Cholla, this paper also presents a Linux-based prototype implementation of this architecture that controls and coordinates adaptation policy decisions inside network protocols and multimedia applications. An experimental evaluation of this prototype demonstrates that Cholla's controller architecture enables component-based construction and customization of adaptation policies in networked systems, and that these policies can effectively control and coordinate adaptation.
The paper "An Online Mechanism for BGP Instability Detection and Analysis" by Deshpande et al. proposes an online instability detection architecture for the border gateway protocol (BGP). BGP is one of the core building blocks of the Internet and is used to direct traffic between different subnetworks of the Internet, known as Autonomous Systems (ASs). The stability of BGP is critical to the stability of the Internet, and under stressful conditions such as worm attacks or router failures, BGP's basic update protocol can result in severe route fluctuations and loss of connectivity for impacted networks for extended periods of time.
The proposed architecture can be implemented by individual routers autonomously without the need for central control or centralized data. This allows widespread deployment without the need for modification of standards and also allows partial deployments without the need of large-scale cooperation among networks. The proposed system uses statistical techniques to detect instabilities using the time domain characteristics of BGP update message data and does not need any additional probing. The experimental evaluation using data collected from routers in Europe including data from several anomaly periods show that the system performs well various kinds of instabilities and across different topologies and policies.
The paper "An Integrated Planning and Adaptive Resource Management Architecture for Distributed Real-Time Embedded Systems" by Shankaran et al. looks at autonomicity issues in an open distributed embedded real-time environment. Such systems which may arise in context such as space control missions have fundamentally different characteristics than traditional embedded systems with controllable inputs. Autonomic operations require a close coordination between the resource planning needs of the system and the planning of different activities within the system. The paper introduces a novel integrated planning, allocation, and control framework that combines aspects of decision-theoretic planning, dynamic resource allocation, and runtime system control. This framework enables open DRE systems to operate autonomously by dynamically assembling high expected utility applications, monitoring of resource utilization and application QoS, performing fine-grained adaptations in response to fluctuations in resources utilization, and performing functional system adaptation in response to changing environmental conditions or significant variation in system resource availability. The design of the framework is geared toward improving system stability, reducing system downtime, and effective coordination of system adaptations at various levels.
The paper "Data Security in Unattended Wireless Sensor Networks" by Di Pietro et al. looks at the challenges of security in the context of unattended wireless sensor networks. In such setting, they introduce the novel model of an adversary who is capable of moving among compromised sensor nodes, and attempts to observe, modify or erase data being produced by uncompromised sensors. They examine various non-cryptographic counter-measure strategies such as data migration and data replication and analyze the effectiveness of the different counter-measure strategies in improving the probability of data survival.
The paper "Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing" by Z. Chen and J. Dongarra presents a self-healing approach that is based on FT-MPI (fault-tolerant version of MPI) and diskless checkpointing. Due to the high frequency of failures and the large number of processors in next generation computing systems, the classical checkpoint/restart fault tolerance approach may become a very inefficient way to handle failures. Even making generous assumptions on the reliability of a single processor or link, it is clear that as the processor count in high end clusters grows into the tens of thousands, the mean-time-to-failure of these clusters will drop from a few years to a few days, or less. Applications built in this framework are self-healing and can survive multiple simultaneous node or link failures. When part of an application's processes fails due to node or link failure, unlike in classical checkpoint/restart fault tolerance paradigm, the application in the proposed fault-tolerant framework will not be aborted. Instead, it will keep all of its surviving processes and adapt itself to failures.
By eliminating stable storage from checkpointing and replacing it with memory and processor redundancy, diskless checkpointing removes the main source of overhead in checkpointing on distributed systems. It is assumed that disjoint pairs of processors can communicate with each other without interference from each other. To tolerate multiple simultaneous process failures of arbitrary patterns with minimum process redundancy, a weighted checksum scheme can be used. A weighted checksum scheme can be viewed as a version of the Reed-Solomon erasure coding scheme in the real number field. Several relevant experiments demonstrate that the recovery time increases approximately linearly as the number of failed processors increase.
It is shown that both the checkpoint overhead and the recovery overhead are very stable as the total number of computing processes increases from 60 to 480. These experimental results are consistent with the theoretical results.
The paper "Using Virtualization to Improve Software Rejuvenation" by Silva et al. utilizes virtualization for tolerating failures due to software aging in web servers. For avoiding server's downtime, a cluster configuration is used i.e., when one server is restarted there is always a back up server and thus, no requests are lost.
The virtualization middleware XEN has been adopted for implementing the proposed approach. There are three virtual-machines per one application server: VM-LB to run a software load balancer, second VM for the main application server, and a third VM operates as a hot- standby. The VM-LB also is responsible for detecting software aging or transient failures. In the case of detected failure, a rejuvenation action will be triggered. The Aging and Anomaly-Detectors are core modules in the system. All software modules been have implemented by using open-source tools, like LVS, ldirectord, and Ganglia. If the system starts getting slower then the web server is rejuvenated. It is demonstrated that the application's crash due to the internal memory leaks can be avoided. It has been shown that a server's restart using the proposed rejuvenation scheme results in a zero failed requests. It is concluded that the rejuvenation mechanism was able to maintain a sustained level of performance of the server in the presence of software aging and transient failures in web servers.
D.R. Avresky Harald Prokop Dinesh C. Verma Guest Editors
• D.R. Avresky is with the International Research Institute on Autonomic Network Computing IRIANC, 1776 Commonwealth Ave., 2, Brighton, MA, 02135, and Munich, Germany.
• H. Prokop is with Akamai Technologies Inc., 8 Cambridge, MA, 02142.
• D.C. Verma is with the IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY, 10532.
For information on obtaining reprints of this article, please send e-mail to: firstname.lastname@example.org.
has published more than 116 papers, including IEEE journal papers, in the area of: network (control, routing, fault tolerance, dynamic reconfiguration, adaptive routing, self healing, performance analysis, virtual/overlay networks, and middleware), wireless sensors networks, software and protocol verification, parallel computers, functional programming, testing, and diagnostics. He has supervised 13 PhD students on the above mentioned topics. He has been funded by the US National Science Foundation, Hewlett Packard, Compaq, Tandem, NASA, Motorola Research Labs, Bell Labs, Akamai Tech. Inc. and others institutions in the USA. He is currently a president of the International Research Institute on Autonomic Network Computing (IRIANC), Boston, MA, USA/Munich, Germany. He has been holding academic positions at different universities in the USA and Europe and has created Research Labs in area of "Network Computing" and "Fault-Tolerant and High-Performance Parallel Computers" at these academic institutions. Dr. Avresky has been a guest co-editor of five IEEE journals: IEEE Transactions on Computers
, special section on "autonomic network computing," November 2009 ; IEEE Transactions on Computers
, special issue on "embedded fault-tolerant systems," February 2002; IEEE Transactions on Parallel and Distributed Systems
, special issue on "dependable network computing," February 2001; IEEE Micro
, IEEE Computer Society, special issue on "Embedded Fault-Tolerant Systems," September/October 2001; and IEEE Micro
, IEEE Computer Society, USA, special issue on "embedded fault-tolerant systems," September/October 1998. In addition, six books and five book chapters have been published: Dependable Network Computing
, (Kluwer Academic Publishers, 2000), Fault-Tolerant Parallel and Distributed Systems
, (Kluwer Academic Publishers, 1998), Fault-Tolerant Parallel and Distributed Systems
, (Computer Society Press, 1995), Hardware and Software Fault-Tolerance in Parallel Computing Systems
, (Simon & Schuster International Group, 1992); Fault-Tolerant Microprocessor Systems
, (Jusautor, 1984); and Diagnostics and Reliability of Computers
, (Jusautor, 1979). He is a founder and a program/steering committee chair of the IEEE International Symposium on Network Computing and Applications (NCA*), Cambridge, MA, during (2002 - 2009), the Annual IEEE International Workshop FTPDS (now DPDNS), held in conjunction with IEEE IPDPS, during 1996-2009, and the IEEE Workshop on Embedded-Fault Tolerant Systems (EFTS), (1996, Dallas), (1998, Boston) and (2000, Washington, DC). He served as a reviewer for IEEE journals. He has been a member of the program committee and a reviewer for numerous IEEE conferences. He is a senior member of the IEEE.
is Senior Vice President of Engineering at Akamai. He manages the company's engineering efforts and is responsible for building the services that run on Akamai's globally distributed computing platform. He also helps to define and is responsible for executing Akamai's technology strategy with a focus on innovative, market-leading services based on customer requirements. He joined Akamai in 1999 as one of Akamai's first engineers and has held various leadership roles within Akamai's Engineering department. He is one of the architects of Akamai's proprietary system for Internet performance measurement and traffic management, a key component of the Company's services. He designed the development and testing process that brings Akamai services to market. Prior to Akamai, he was working on high-performance and parallel computing at the MIT Laboratory for Computer Science. His master's thesis, "Cache-Oblivious Algorithms," has resulted in an active new research branch of algorithm theory that includes the efficient use of memory hierarchies. He has also received numerous awards during his career. In 2004, he received Akamai's highest employee recognition, the Daniel Lewin Award, for his outstanding technical contributions to the company. In 1996, he received a fellowship award from Cusanuswerk, Germany. In 1994, he was one of the winners of the Bundeswettbewerb Informatik, an annual computer science contest sponsored by the federal government in Germany that attracted thousands of student entrants. He received the MSc degree in electrical engineering and computer science from MIT, and the undergraduate degree in computer science from the Universität Karlsruhe in Germany. He has been published in the fields of Internet computing and computer science.
Dinesh C. Verma
is a researcher and senior manager in the networking technology area at IBM T.J. Watson Research Center, Hawthorne, New York. He received his doctorate degree in computer networking from the University of California, Berkeley, in 1992, the bachelors in computer science from the Indian Institute of Technology, Kanpur, India, in 1987, and the masters degree in management of technology from Polytechnic University, Brooklyn, New York, in 1998. He holds more than 28 US patents related to computer networks, and has authored more than 50 papers and five books in the area. He is the program manager for the US/UK International Technology Alliance in Network Sciences. He is a fellow of the IEEE, and has served on various program committees and technical committees. His research interests include topics in wireless networks, network management, distributed computing, and autonomic systems.