September 2004 (Vol. 5, No. 9) 1541-4922/04/$26.00 © 2004 IEEE Published by the IEEE Computer Society Service Continuity in Networked Control Using Etherware
Safety requirements and the ability to tolerate changes make service continuity crucial for networked control systems. The authors describe how their middleware for networked control, Etherware, addresses these issues. The ability to tolerate change is a fundamental quality of a sustainable dynamic system. A system's ability to respond to change in its operating conditions determines, to a large extent, its useful lifetime. However, it's impractical to anticipate all changes before you deploy a system. Instead, systems are often designed to be able to evolve and adapt dynamically. Software systems are particularly malleable because of the "program as data" concept that the von Neumann architecture introduced. 1 Indeed, developers have created systems software such as operating systems and middleware to manage application software. This is the basis of most dynamically evolvable software systems today. Our focus is on networked control systems comprising sensors, actuators, and computers that coordinate over a network to manage distributed real-time systems. Because these systems directly interact with the real world, they're subject to drastic and often unpredictable changes. Hence, maintaining such systems' operational integrity requires containing the changes' effects. Further, developers usually build these systems as networks of components that interact by providing and consuming services. Therefore, service continuity is a basic requirement that must be addressed for the viability of such applications. In this article, we first develop the notion of operational integrity for networked control and identify the main challenges of maintaining it. Second, we identify several middleware issues involved in operational integrity, consider their influence on the design of Etherware—our middleware for networked control—and describe how Etherware supports service continuity. A key feature of Etherware is the ability to maintain communication channels during component restarts and upgrades. Operational integrity in networked control Networked control software interacts with a distributed real-time system, usually called a plant. Sensors provide feedback about the plant behavior. Computers running control programs use this sensor feedback to generate actuator commands that accomplish specified goals. Actuators implement these commands to control the plant. Sensors and actuators constitute the interface between the software and the real world (see Figure 1 ). Also, the controller is typically implemented as a set of software components operating over a network of computers. Figure 1. Schematic of a networked control system. The goals that the controller receives specify the objectives to be achieved during plant operation. An important part of these goals is a set of safety criteria that ensures that the plant operates in a "safe" region. For example, in a traffic control system, a safety criterion would be to maintain a given distance between any two cars, ensuring against collisions. In general, safety criteria depend on specific applications and are usually part of the system specification. In this context, operational integrity is the property that the system always meets its safety criteria while still maintaining its operational capability, which is its ability to meet specified goals. Operational networked control systems are subject to many changes, classified as voluntary or involuntary. Voluntary changes such as component upgrades and configuration changes are intentionally introduced by a system operator and might affect one or more of the following: Environmental events beyond operator control cause involuntary changes, such as To maintain operational integrity, you must handle these changes dynamically and minimize their impact on the system's operational capability. Notably, active and Byzantine failures require semantic information and must be handled by application-specific mechanisms. Maintaining service continuity A system designer typically creates middleware-based systems as a set of coordinating components that interact by providing and consuming services. For networked control systems in particular, some of these services might be critical for a component's operation. For example, the controllers in Figure 1 depend on getting feedback from the sensors. Controls are usually discrete and calibration is imperfect. Therefore, any changes in this feedback service could result in serious faults in controller operation. Hence, the feedback service's continuity is imperative to the controller's operation. Similarly, the actuators depend on receiving controls from the controller. Networked control systems have fairly strict safety requirements, so components must respond to changes as soon as possible. For example, on detecting a safety violation, a controller component might not be able to wait for an acknowledgment from another remote component before it decides to take a safe action. On a wireless channel in particular, delays can be fairly large because of interference and fluctuating channel conditions. To maintain operational integrity, components must be able to operate asynchronously. Another important consideration is the presence of dependencies in push-based communication channels. For example, a controller can't wait for updates from the sensor before sending controls to the actuator. In particular, updates might be delayed or lost because of communication failures in a wireless link. Synchronous communication would require such components to be multithreaded or to use a fairly complex design involving poller objects for each sensor. Asynchronous operation, on the other hand, eliminates this source of complexity. Based on these considerations, we've developed Etherware as an asynchronous message-based middleware. Key components might terminate owing to voluntary upgrades or involuntary failures. To maintain operational integrity, however, components must be restarted and operational within application-specified deadlines. For example, if a controller is restarted, it should know the plant's current state, as this information might not be entirely available from sensor feedback. Checkpointing is a common technique for addressing this requirement. Necessary state is checkpointed so components can be restarted appropriately. To support this, we've closely integrated checkpointing with Etherware design and provided it as a basic service. Components typically maintain several communication channels with other components. For service continuity, restarting or updating such components shouldn't require channels to be reestablished. Consequently, Etherware also supports maintaining communication channels across these changes. In particular, you can save identifiers for communication channels as part of checkpointed state. This lets restarted or upgraded components continue using previously established channels. It also provides communication continuity to other components during these changes. The need to support efficient components restarts has motivated another basic design choice in Etherware. A single-kernel process manages all components on a given node. Components can have separate threads if necessary. Furthermore, services the middleware provides also must be easily restartable and upgradable. Also, invariant aspects of the middleware that can't be changed dynamically must be minimized for maximum flexibility. This motivated us to adopt a microkernel-based design for Etherware. 3 This philosophy of flexibility has also resulted in the development of a bare minimum, functional interface for components to interact with the middleware. For flexibility and uniformity, all interaction with middleware services is message based. Etherware Etherware is messaging middleware implemented in Java for portability. In an Etherware-based application, components communicate by exchanging messages, which are XML documents. 4 Etherware provides a hierarchy of Java classes that application components can use to manipulate these documents. The hierarchy's root is the Message class that provides various primitives to manipulate the underlying XML document. Application-defined messages are required to be Message subclasses. Figure 2 a shows a generic component's interactions in an Etherware-based network control application. Components participate in control hierarchies such as between a supervisor and a controller (see Figure 2 a) and in data flows, such as from sensors to controllers. We based the design in Figure 2 a on design patterns that include 5 Figure 2. An Etherware programming model: (a) programming model of an Etherware component and (b) filters for MessageStreams. Etherware components can be active or passive. Passive components don't have active threads of control. They only respond to incoming messages by processing them appropriately and generating resulting messages, if any. Active components, on the other hand, can have one or more active threads of control. They can generate messages based on activities in their individual threads of control. Message delivery in distributed systems face two basic problems: discovering and identifying destination components. Associating profiles to addressable components solves the discovery problem. Each component that must be addressed registers a profile with the middleware. We solve the identification problem in Etherware by associating a globally unique ID, called a binding, to each component. A profile is an XML-based description of a service a component provides. For example, a vision sensor profile could specify that the sensor is a grayscale camera covering the region between the points (0,0) and (100,50) in an appropriate coordinate space. Suppose a controller is only interested in the location between coordinates (25,37) and (45,52). It could use this information to connect to a relevant vision sensor, using an appropriately defined profile request. Etherware would then match this profile with the sensor's profile and forward the connection request to it. In general, we've defined suitable matching rules to match profiles. Incorporating the features we've mentioned, all messages in Etherware are XML documents and have three constituent XML tags. Profile identifies the message's recipient and can be a service description (as mentioned) or a component's globally unique binding. Content represents the message's contents and contains all application-specific information. A time stamp is associated with each message. As a message is transferred from one node to another, the time stamp automatically translates to the destination's local time. By default, messages are delivered reliably and in order. However, there might be streams of messages that must be delivered using other specifications. For example, a controller might tolerate a few lost sensor updates for lower delays and need to trade some reliability for lower delays. To identify and manipulate a stream of messages as a separate entity, Etherware introduces the notion of a MessageStream. A component can open a MessageStream to another component and send messages through it. MessageStreams have settings that you can use to specify how messages are delivered through them. Changes in operating conditions might also require modifying messages in a MessageStream. For example, updates from the vision sensor could get noisy because of bad lighting conditions. We should be able to filter out this noise without having to modify the sensor or the controller. Etherware supports this by adding Filters to MessageStreams dynamically. Figure 2 b shows the effective configuration after adding a Kalman filter to the MessageStream between a sensor and a controller. You can also add filters to intercept all messages sent to or received by a component. Etherware's architecture is based on the microkernel concept (see Figure 3 ). The kernel manages all components in a single process and represents the minimum invariant in Etherware. In the current implementation, we have one Etherware process per node. The kernel's basic function is to deliver messages between its (local) components. It also exposes a service interface, which can be used to manipulate the components it manages. The kernel has a scheduler that's responsible for scheduling all messages and threads and can be replaced dynamically. Figure 3. An Etherware process configuration. As Figure 3 illustrates, each component is encapsulated in its own shell. A shell presents a facade to the component and provides a uniform interface for it to interact with the rest of the system. Shells also encapsulate component-specific information, such as MessageStream configuration, and handle activities involved in component restart, upgrade, and migration. Service components provide all other functionality in Etherware. The following basic services are used during Etherware's normal operation: Because Etherware has these services implemented as components, it can also restart or update them dynamically. Service continuity using Etherware Etherware supports service continuity during most of the changes we've listed in the following ways: We've tested the performance of these mechanisms through experiments on a prototype traffic-control application. 7 Related research Fault-tolerant CORBA is the primary Object Management Group specification that addresses fault tolerance in distributed systems. 8 The key mechanism in FT-CORBA is supporting fault tolerance through redundancy of entities. However, a chief problem with this model is that, because replicas execute the same algorithms and have the same inputs, they'll have similar failures owing to application errors. Thus, safe component restarts are also necessary to support such failures. Researchers have also considered other problems that future work must address before we can use FT-CORBA for distributed real-time systems. 9 Software frameworks and middleware for networked control in general are areas of active research. Open Control Platform 10 is a Real-Time CORBA-based middleware 11 for reconfigurable control systems. Although OCP supports service continuity during component and service reconfiguration, it doesn't provide mechanisms to tolerate faults in application software. Detailed surveys address related efforts. 12 , 13 Conclusion Etherware is the basis for much ongoing and future research. For example, Simplex is an elegant architecture that supports safe, dynamic upgrades of control software. 14 However, component restarts or upgrades still require the application to reestablish communication channels, and this might affect operational integrity. Because Etherware can support this, we're working to incorporate the Simplex architecture into it. We're also developing an interface description language to support message type specification and component interaction semantics in Etherware. We've also implemented a traffic control testbed using Etherware. 7 On the basis of state estimation and buffering techniques, 2 we were able to develop the system with soft real-time control. However, Etherware doesn't yet support hard real-time deadlines, so incorporating this is another aspect of our current research. References
| |||||||||||||||||||||||||||||||||||||||||||||||||||||