, IEEE Computer Society
Pages: pp. 1441-1443
In the paper "Slicing Distributed Systems," Gramoli et al. consider the problem of slicing—the partitioning of nodes into subsets using an one-dimensional attribute—in distributed systems. This problem arises in peer-to-peer networks and other popular distributed systems used in today's large data centers. Slicing can help autonomic systems shift work dynamically among nodes and react dynamically to nodes dropping from the system or new nodes joining. The paper reviews existing approaches such as sorting and ranking and shows their problems in convergence, resulting in slow adaptation to changed conditions or inaccurate slice assignments in nonstatic networks. The authors introduce Sliver, a simple distributed slicing protocol that samples attribute values from the network and estimates the slice index from the sample. The theoretical results show that Sliver has provably rapid conversion, is robust under stress, and is simple to implement which they also show with simulation results of the new protocol.
The paper "Cholla: A Framework for Composing and Coordinating Adaptations in Networked Systems" by Bridges et al. describes Cholla, a software architecture that separates the policy decisions of how and when adaptive components in networked systems react to their environment into separate centralized controllers that are constructed from composable rule sets.
A major goal of this work is to develop a controller architecture that can be used for any collection of adaptive networking components, not a specific architecture that has to be implemented for each application-specific configuration or reimplemented each time that configuration changes. To achieve this, controllers are designed to be composable, with each adaptive component in the system contributing logic that together implements the overall functionality of the controller. This logic is implemented as rule sets that describe particular adaptation policies, and allow rule sets that implement linear, heuristic, and fuzzy control policies. Using composable control logic allows the controller to be specialized based not only on the set of adaptive components in the system, but also on the target network architecture and application requirements.
This architecture consists of adaptable components, composable adaptation controllers, and a runtime system. In addition to describing the architecture of Cholla, this paper also presents a Linux-based prototype implementation of this architecture that controls and coordinates adaptation policy decisions inside network protocols and multimedia applications. An experimental evaluation of this prototype demonstrates that Cholla's controller architecture enables component-based construction and customization of adaptation policies in networked systems, and that these policies can effectively control and coordinate adaptation.
The paper "An Online Mechanism for BGP Instability Detection and Analysis" by Deshpande et al. proposes an online instability detection architecture for the border gateway protocol (BGP). BGP is one of the core building blocks of the Internet and is used to direct traffic between different subnetworks of the Internet, known as Autonomous Systems (ASs). The stability of BGP is critical to the stability of the Internet, and under stressful conditions such as worm attacks or router failures, BGP's basic update protocol can result in severe route fluctuations and loss of connectivity for impacted networks for extended periods of time.
The proposed architecture can be implemented by individual routers autonomously without the need for central control or centralized data. This allows widespread deployment without the need for modification of standards and also allows partial deployments without the need of large-scale cooperation among networks. The proposed system uses statistical techniques to detect instabilities using the time domain characteristics of BGP update message data and does not need any additional probing. The experimental evaluation using data collected from routers in Europe including data from several anomaly periods show that the system performs well various kinds of instabilities and across different topologies and policies.
The paper "An Integrated Planning and Adaptive Resource Management Architecture for Distributed Real-Time Embedded Systems" by Shankaran et al. looks at autonomicity issues in an open distributed embedded real-time environment. Such systems which may arise in context such as space control missions have fundamentally different characteristics than traditional embedded systems with controllable inputs. Autonomic operations require a close coordination between the resource planning needs of the system and the planning of different activities within the system. The paper introduces a novel integrated planning, allocation, and control framework that combines aspects of decision-theoretic planning, dynamic resource allocation, and runtime system control. This framework enables open DRE systems to operate autonomously by dynamically assembling high expected utility applications, monitoring of resource utilization and application QoS, performing fine-grained adaptations in response to fluctuations in resources utilization, and performing functional system adaptation in response to changing environmental conditions or significant variation in system resource availability. The design of the framework is geared toward improving system stability, reducing system downtime, and effective coordination of system adaptations at various levels.
The paper "Data Security in Unattended Wireless Sensor Networks" by Di Pietro et al. looks at the challenges of security in the context of unattended wireless sensor networks. In such setting, they introduce the novel model of an adversary who is capable of moving among compromised sensor nodes, and attempts to observe, modify or erase data being produced by uncompromised sensors. They examine various non-cryptographic counter-measure strategies such as data migration and data replication and analyze the effectiveness of the different counter-measure strategies in improving the probability of data survival.
The paper "Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing" by Z. Chen and J. Dongarra presents a self-healing approach that is based on FT-MPI (fault-tolerant version of MPI) and diskless checkpointing. Due to the high frequency of failures and the large number of processors in next generation computing systems, the classical checkpoint/restart fault tolerance approach may become a very inefficient way to handle failures. Even making generous assumptions on the reliability of a single processor or link, it is clear that as the processor count in high end clusters grows into the tens of thousands, the mean-time-to-failure of these clusters will drop from a few years to a few days, or less. Applications built in this framework are self-healing and can survive multiple simultaneous node or link failures. When part of an application's processes fails due to node or link failure, unlike in classical checkpoint/restart fault tolerance paradigm, the application in the proposed fault-tolerant framework will not be aborted. Instead, it will keep all of its surviving processes and adapt itself to failures.
By eliminating stable storage from checkpointing and replacing it with memory and processor redundancy, diskless checkpointing removes the main source of overhead in checkpointing on distributed systems. It is assumed that disjoint pairs of processors can communicate with each other without interference from each other. To tolerate multiple simultaneous process failures of arbitrary patterns with minimum process redundancy, a weighted checksum scheme can be used. A weighted checksum scheme can be viewed as a version of the Reed-Solomon erasure coding scheme in the real number field. Several relevant experiments demonstrate that the recovery time increases approximately linearly as the number of failed processors increase.
It is shown that both the checkpoint overhead and the recovery overhead are very stable as the total number of computing processes increases from 60 to 480. These experimental results are consistent with the theoretical results.
The paper "Using Virtualization to Improve Software Rejuvenation" by Silva et al. utilizes virtualization for tolerating failures due to software aging in web servers. For avoiding server's downtime, a cluster configuration is used i.e., when one server is restarted there is always a back up server and thus, no requests are lost.
The virtualization middleware XEN has been adopted for implementing the proposed approach. There are three virtual-machines per one application server: VM-LB to run a software load balancer, second VM for the main application server, and a third VM operates as a hot- standby. The VM-LB also is responsible for detecting software aging or transient failures. In the case of detected failure, a rejuvenation action will be triggered. The Aging and Anomaly-Detectors are core modules in the system. All software modules been have implemented by using open-source tools, like LVS, ldirectord, and Ganglia. If the system starts getting slower then the web server is rejuvenated. It is demonstrated that the application's crash due to the internal memory leaks can be avoided. It has been shown that a server's restart using the proposed rejuvenation scheme results in a zero failed requests. It is concluded that the rejuvenation mechanism was able to maintain a sustained level of performance of the server in the presence of software aging and transient failures in web servers.
D.R. Avresky Harald Prokop Dinesh C. Verma Guest Editors