, AT&T Bell Laboratories
Pages: pp. 1316
Abstract—Not a day goes by that the general public does not come intocontact with a realtime system. As their numbers andimportance grow, so do the implications for software developers.
Applications that call for realtime systems are particularlysusceptible to failures. Our challenge is to design, analyze,and build such systems to prevent failures or (at least) tomitigate their effect on operations. The theme articles inthis issue, summarized in the box ( "Article Summaries: Safety and Reliability"), explore different ways to meet this challenge.
What differentiates the development of realtime systemssoftware from other applications? How do reliability and safetyfigure into their development?
A realtime system must respond to externally generated stimuliwithin a finite, specifiable time delay. Realtime systems aretypically embedded systems that interface directly to the physicalequipment they operate, monitor, and control. Examples of suchsystems are flight and weaponscontrol systems, airtraffic controlsystems, telecommunicationsswitching and network equipment,manufacturing processcontrol, and even speechrecognitionsystems, which are beginning to appear in telecommunicationsnetworks and personal computers.
Not a day passes that we don't come into contact with arealtime system. And now they are becoming more prevalent incritical applications. A failure in a critical application such as atelecommunications system may result in great financial loss; in aflightcontrol system it may result in loss of life.
More effort must be expended to analyze the reliability andsafety of such systems. Analysis of hardware components incritical applications has matured over the years and commonlyfollowed techniques have emerged. However, methods andtechniques for analyzing the reliability and safety of the softwarepart of critical applications are relatively new and still maturing.Yet the vulnerability of the system to software failures is on therise and may (and in some cases do) exceed hardware failures.
Software is not only becoming more prevalent in realtimesystems, it is becoming a larger part of realtime systems. Bylarger, we mean the amount of effort expended in designing andimplementing the software over total expended effort.
What differentiates realtime software development from othersoftware?
♦ First, its design is resourceconstrained. The primaryresource that is constrained is time. Depending on processorspeed, this time constraint equates to the number ofprocessing cycles required to complete the task. However,time is not the only resource that can be constrained. Theuse of main memory may also be constrained. In fact,designers may trade off reducing processing cycles at theexpense of using more memory.
♦ Second, realtime software is compact yet complex. Eventhough the entire system may have millions of lines of code,the timecritical part is but a small fraction of the totalsoftware. Yet these timecritical parts are highly complex, withmuch of the complexity introduced to conserve constrainedresources.
♦ Third, unlike other software systems, realtime systems donot have the luxury of failurerecovery mechanisms. Theremay not be a human around to help the software recoverfrom failure. Such software must detect when failure occurs,continue to operate despite the failure, contain the damage tosurrounding data and processes, and recover quickly so as tominimize operating problems.
In developing realtime software, more time is spent in design,especially analyzing ways to improve performance and enhancereliability and safety. Development of most other software focuseson how to handle a normal situation, but realtime,criticalapplication development also focuses on how to handlethe abnormal situation. Also, more time is spent in testingrealtime systems.
Testing not only removes faults (the underlying source ofsoftware failure) and hence improves its reliability, it alsodemonstrates reliability and safety by running software underfield conditions. For safetycritical software, additional time isspent in analyzing failure modes that contribute to safetyhazards.
♦ Software Testability: The New Verification, pp 1728.
Jeffrey M. Voas and Keith W. Miller
As software begins to replace human decisionmakers, afundamental concern is whether a machine will be able toperform the tasks with the same level of precision as a skilledperson. The reliability of an automated system must be highenough to avoid a catastrophe. But how do you determine thatcritical automated systems are acceptably safe and reliable? Inthis article, we present a new view of verification and offertechniques that will help developers make this assessment. Ourview, which we label software testability, looks at dynamicbehavior, not just syntax. This differs from traditional verificationand testability views.
Our research suggests that software testability clarifies acharacteristic of programs that has largely been ignored. Wethink that testability offers significant insights that are usefulduring design, testing, and reliability assessment. In conjunctionwith existing testing and formal verification methods, testabilityholds promise for quantitative improvement in statistically verifiedsoftware quality. The methods described in this article areapplicable to any software system, not just realtime systems.
♦ Reliability Through Consistency, pp. 2941
Kenneth P. Birman and Bradford B Glade
Distributed computing systems are increasingly common incritical settings, in which unavailability can result in loss ofrevenue or incapacitate informationdependent infrastructurecomponents. In many distributed computing environments, failuresare reported in a way that violates even the simplest notions ofconsistency. We think such practices are at the root of reliabilityproblems.
Failure notification is key to effective failure management. Whenprocesses are not aware that another process has failed, they maynot operate reliably. Achieving such awareness requires somemechanism for consistent failure reporting among processes.Unfortunately, many modern computing systems treat failurereporting as an applicationspecific problem, which puts theburden of failure recovery on the application developer. Thus,failurereporting mechanisms will vary from system to system,even though the reliability requirements of applications spanthese systems. In turn, this puts the onus on the applicationsdeveloper to overcome unreliability in the component systems.
Standards bodies have also overlooked this issue: nocommunications standard today requires consistency in failurereporting. Indeed, no standard even provides for the addition ofconsistencypreserving mechanisms.
We believe inconsistent failure reporting is one of the majorbarriers to progress in developing highly reliable, selfmanaged,distributed software systems and applications. Such inconsistencyis dismaying, especially when it seems to be unnecessary:implementing consistent reporting is neither particularly costly noroverwhelmingly difficult. If more developers came to appreciate itsvalue, we believe that a major barrier to distributed applicationdevelopment could be eliminated.
♦ Analyzing Safety Requirements for ProcessControl Systems, pp.4253
Rogério de Lemos, Amer Saeed, and Tom Anderson
Demands for dependability are rising faster than what cancurrently be achieved, especially for complex systems. As thetrend continues, researchers are looking more closely at therequirements phase as the potential solution for managing errors.Experience shows that mistakes made during this phase can easilyintroduce faults that lead to accidents, so preventing faults duringthis phase should produce more dependable systems.
However, in safetycritical systems, you cannot assess theadequacy of safety requirements except with respect to overallsystem risk. This means that you must also conduct a safetyanalysis of the resulting safety specifications to ensure that thesoftware's contribution to system risk is acceptable.
We have developed an approach to identify and analyze thesafety requirements of safetycritical processcontrol systems. Ourapproach increases the visibility of the requirements process bypartitioning the analysis into distinct phases that are based onthe domains of analysis established from the system structure.Freedom to tailor a technique to a specific domain of analysisand perspective ensures that the most appropriate techniques areapplied. Formally recording the safety specifications helps inbuilding and comparing safety strategies. Finally, using bothqualitative and quantitative techniques gives a precise picture ofthe specification's contribution to overall system risk.
♦ Scheduling in Hard RealTime Applications, pp. 5464
Jiang Zhu and Ted G. Lewis
A major problem with hard realtime systems is how to confirmthat they really work. In these systems, the computer periodicallygets information from the environment through sensors, updatesits internal systems' states based on those inputs and its currentinternal states, and generates control commands to change theenvironment through actuators.
The success of such a system depends not only on its logicalcorrectness but also on its timing correctness. A timingdependentlifecritical system is called a hard realtime system. It must makecorrect responses to environmental changes within specified timeintervals or deadlines.
Our work involves proving theorems that guarantee deadlines ina uniprocessor environment will be met and developing newgraphical languages for the design of hard realtime systems. Ourresearch has shown that we can effectively combine graphicaldesign languages and deadlinescheduling algorithms. Ourgraphical design language integrates dataflow and control flowinto one diagram, which easily reveals the entire picture of ahard, realtime application. This picture is much more difficult tograsp when the dataflow and control flow diagrams are viewedseparately.
We have developed a CASE tool that implements the designdiagram and our schedulability checking methods. The toolautomatically transforms a graphical design into a task set andthen schedules it. The tool can be used by controlapplicationengineers to graphically design an application and automaticallyanalyze the design for schedulability on a given processor. If thedesign is feasible, it can then be automatically transformed intoAda.
Because there are contributions in this issue in three importantareas — realtime systems, reliability, and safety — we willoutline where you can obtain more information in each areaseparately.
Two good places to start are the "RealTime Realities"(September 1992) and "High Performance" (September 1991)special issues of IEEE Software. In addition, the survey chapterin the Software Engineer's Reference Book (John A. McDermid,ed., Butterworth/Heinemann, 1991) and the section entitled "RealTime Resource Management Techniques" (pp. 10111020) in Encyclopedia of Software Engineering (John J. Marciniak, ed.,John Wiley & Sons, 1994) are useful references. Finally, Foundations of RealTime Computing, edited by A. van Tilborgand G.M. Koob, (Kluwer Academic Publishers 1991) and, LectureNotes in Computer Science (two volumes), edited by J.W.deKakker (SpringerVerlag, 1992), include the definitive readingsin realtime systems.
Again, the IEEE Software special issues, "Steps to PracticalReliability Measurement," (July 1991) and "Testing the Limits ofTest Technology" (March 1991) are good places to start. Twoshort articles — "Software Reliability Engineering" and"Software Reliability Theory" — in Encyclopedia of SoftwareEngineering and the article "Software Reliability Modelling" inthe Software Engineer's Reference Book provide nice reviews.Also, the article "Testing" in the Software Engineer's ReferenceBook is a good survey article on testing technology. For theserious reader, I suggest Software Reliability: Measurement,Prediction, Application (J.D. Musa, A. Iannino, and K. Okumoto,McGraw Hill, 1987).
Here we recommend the IEEE Software theme issue on "CriticalTask of Writing Dependable Software," (January 1994), the articleson safety in Encyclopedia of Software Engineering and the Software Engineer's Reference Book, and the article "SoftwareSafety in Embedded Computer Systems," by Nancy Leveson( Communications of the ACM, Feb. 1995).
— William Everett