The Community for Technology Leaders

Reliability and Safety of RealTimeSystems

William W. Everett, AT&T Bell Laboratories
Shinichi Honiden, Toshiba

Pages: pp. 1316

Abstract—Not a day goes by that the general public does not come intocontact with a realtime system. As their numbers andimportance grow, so do the implications for software developers.


Applications that call for realtime systems are particularlysusceptible to failures. Our challenge is to design, analyze,and build such systems to prevent failures or (at least) tomitigate their effect on operations. The theme articles inthis issue, summarized in the box ( "Article Summaries: Safety and Reliability"), explore different ways to meet this challenge.

What differentiates the development of realtime systemssoftware from other applications? How do reliability and safetyfigure into their development?

A realtime system must respond to externally generated stimuliwithin a finite, specifiable time delay. Realtime systems aretypically embedded systems that interface directly to the physicalequipment they operate, monitor, and control. Examples of suchsystems are flight and weaponscontrol systems, airtraffic controlsystems, telecommunicationsswitching and network equipment,manufacturing processcontrol, and even speechrecognitionsystems, which are beginning to appear in telecommunicationsnetworks and personal computers.

GROWING PREVALENCE

Not a day passes that we don't come into contact with arealtime system. And now they are becoming more prevalent incritical applications. A failure in a critical application such as atelecommunications system may result in great financial loss; in aflightcontrol system it may result in loss of life.

More effort must be expended to analyze the reliability andsafety of such systems. Analysis of hardware components incritical applications has matured over the years and commonlyfollowed techniques have emerged. However, methods andtechniques for analyzing the reliability and safety of the softwarepart of critical applications are relatively new and still maturing.Yet the vulnerability of the system to software failures is on therise and may (and in some cases do) exceed hardware failures.

Software is not only becoming more prevalent in realtimesystems, it is becoming a larger part of realtime systems. Bylarger, we mean the amount of effort expended in designing andimplementing the software over total expended effort.

DIFFERENTIATING FACTORS

What differentiates realtime software development from othersoftware?

♦ First, its design is resourceconstrained. The primaryresource that is constrained is time. Depending on processorspeed, this time constraint equates to the number ofprocessing cycles required to complete the task. However,time is not the only resource that can be constrained. Theuse of main memory may also be constrained. In fact,designers may trade off reducing processing cycles at theexpense of using more memory.

♦ Second, realtime software is compact yet complex. Eventhough the entire system may have millions of lines of code,the timecritical part is but a small fraction of the totalsoftware. Yet these timecritical parts are highly complex, withmuch of the complexity introduced to conserve constrainedresources.

♦ Third, unlike other software systems, realtime systems donot have the luxury of failurerecovery mechanisms. Theremay not be a human around to help the software recoverfrom failure. Such software must detect when failure occurs,continue to operate despite the failure, contain the damage tosurrounding data and processes, and recover quickly so as tominimize operating problems.

In developing realtime software, more time is spent in design,especially analyzing ways to improve performance and enhancereliability and safety. Development of most other software focuseson how to handle a normal situation, but realtime,criticalapplication development also focuses on how to handlethe abnormal situation. Also, more time is spent in testingrealtime systems.

Testing not only removes faults (the underlying source ofsoftware failure) and hence improves its reliability, it alsodemonstrates reliability and safety by running software underfield conditions. For safetycritical software, additional time isspent in analyzing failure modes that contribute to safetyhazards.

ARTICLE SUMMARIES: SAFETY AND RELIABILITY

♦ Software Testability: The New Verification, pp 1728.

Jeffrey M. Voas and Keith W. Miller

As software begins to replace human decisionmakers, afundamental concern is whether a machine will be able toperform the tasks with the same level of precision as a skilledperson. The reliability of an automated system must be highenough to avoid a catastrophe. But how do you determine thatcritical automated systems are acceptably safe and reliable? Inthis article, we present a new view of verification and offertechniques that will help developers make this assessment. Ourview, which we label software testability, looks at dynamicbehavior, not just syntax. This differs from traditional verificationand testability views.

Our research suggests that software testability clarifies acharacteristic of programs that has largely been ignored. Wethink that testability offers significant insights that are usefulduring design, testing, and reliability assessment. In conjunctionwith existing testing and formal verification methods, testabilityholds promise for quantitative improvement in statistically verifiedsoftware quality. The methods described in this article areapplicable to any software system, not just realtime systems.

♦ Reliability Through Consistency, pp. 2941

Kenneth P. Birman and Bradford B Glade

Distributed computing systems are increasingly common incritical settings, in which unavailability can result in loss ofrevenue or incapacitate informationdependent infrastructurecomponents. In many distributed computing environments, failuresare reported in a way that violates even the simplest notions ofconsistency. We think such practices are at the root of reliabilityproblems.

Failure notification is key to effective failure management. Whenprocesses are not aware that another process has failed, they maynot operate reliably. Achieving such awareness requires somemechanism for consistent failure reporting among processes.Unfortunately, many modern computing systems treat failurereporting as an applicationspecific problem, which puts theburden of failure recovery on the application developer. Thus,failurereporting mechanisms will vary from system to system,even though the reliability requirements of applications spanthese systems. In turn, this puts the onus on the applicationsdeveloper to overcome unreliability in the component systems.

Standards bodies have also overlooked this issue: nocommunications standard today requires consistency in failurereporting. Indeed, no standard even provides for the addition ofconsistencypreserving mechanisms.

We believe inconsistent failure reporting is one of the majorbarriers to progress in developing highly reliable, selfmanaged,distributed software systems and applications. Such inconsistencyis dismaying, especially when it seems to be unnecessary:implementing consistent reporting is neither particularly costly noroverwhelmingly difficult. If more developers came to appreciate itsvalue, we believe that a major barrier to distributed applicationdevelopment could be eliminated.

♦ Analyzing Safety Requirements for ProcessControl Systems, pp.4253

Rogério de Lemos, Amer Saeed, and Tom Anderson

Demands for dependability are rising faster than what cancurrently be achieved, especially for complex systems. As thetrend continues, researchers are looking more closely at therequirements phase as the potential solution for managing errors.Experience shows that mistakes made during this phase can easilyintroduce faults that lead to accidents, so preventing faults duringthis phase should produce more dependable systems.

However, in safetycritical systems, you cannot assess theadequacy of safety requirements except with respect to overallsystem risk. This means that you must also conduct a safetyanalysis of the resulting safety specifications to ensure that thesoftware's contribution to system risk is acceptable.

We have developed an approach to identify and analyze thesafety requirements of safetycritical processcontrol systems. Ourapproach increases the visibility of the requirements process bypartitioning the analysis into distinct phases that are based onthe domains of analysis established from the system structure.Freedom to tailor a technique to a specific domain of analysisand perspective ensures that the most appropriate techniques areapplied. Formally recording the safety specifications helps inbuilding and comparing safety strategies. Finally, using bothqualitative and quantitative techniques gives a precise picture ofthe specification's contribution to overall system risk.

♦ Scheduling in Hard RealTime Applications, pp. 5464

Jiang Zhu and Ted G. Lewis

A major problem with hard realtime systems is how to confirmthat they really work. In these systems, the computer periodicallygets information from the environment through sensors, updatesits internal systems' states based on those inputs and its currentinternal states, and generates control commands to change theenvironment through actuators.

The success of such a system depends not only on its logicalcorrectness but also on its timing correctness. A timingdependentlifecritical system is called a hard realtime system. It must makecorrect responses to environmental changes within specified timeintervals or deadlines.

Our work involves proving theorems that guarantee deadlines ina uniprocessor environment will be met and developing newgraphical languages for the design of hard realtime systems. Ourresearch has shown that we can effectively combine graphicaldesign languages and deadlinescheduling algorithms. Ourgraphical design language integrates dataflow and control flowinto one diagram, which easily reveals the entire picture of ahard, realtime application. This picture is much more difficult tograsp when the dataflow and control flow diagrams are viewedseparately.

We have developed a CASE tool that implements the designdiagram and our schedulability checking methods. The toolautomatically transforms a graphical design into a task set andthen schedules it. The tool can be used by controlapplicationengineers to graphically design an application and automaticallyanalyze the design for schedulability on a given processor. If thedesign is feasible, it can then be automatically transformed intoAda.

SIGNPOSTS AND LANDMARKS: RELIABILITY ANDSAFETY OF REALTIME SYSTEMS

Because there are contributions in this issue in three importantareas — realtime systems, reliability, and safety — we willoutline where you can obtain more information in each areaseparately.

REALTIME SYSTEMS.

Two good places to start are the "RealTime Realities"(September 1992) and "High Performance" (September 1991)special issues of IEEE Software. In addition, the survey chapterin the Software Engineer's Reference Book (John A. McDermid,ed., Butterworth/Heinemann, 1991) and the section entitled "RealTime Resource Management Techniques" (pp. 10111020) in Encyclopedia of Software Engineering (John J. Marciniak, ed.,John Wiley & Sons, 1994) are useful references. Finally, Foundations of RealTime Computing, edited by A. van Tilborgand G.M. Koob, (Kluwer Academic Publishers 1991) and, LectureNotes in Computer Science (two volumes), edited by J.W.deKakker (SpringerVerlag, 1992), include the definitive readingsin realtime systems.

RELIABILITY.

Again, the IEEE Software special issues, "Steps to PracticalReliability Measurement," (July 1991) and "Testing the Limits ofTest Technology" (March 1991) are good places to start. Twoshort articles — "Software Reliability Engineering" and"Software Reliability Theory" — in Encyclopedia of SoftwareEngineering and the article "Software Reliability Modelling" inthe Software Engineer's Reference Book provide nice reviews.Also, the article "Testing" in the Software Engineer's ReferenceBook is a good survey article on testing technology. For theserious reader, I suggest Software Reliability: Measurement,Prediction, Application (J.D. Musa, A. Iannino, and K. Okumoto,McGraw Hill, 1987).

SAFETY.

Here we recommend the IEEE Software theme issue on "CriticalTask of Writing Dependable Software," (January 1994), the articleson safety in Encyclopedia of Software Engineering and the Software Engineer's Reference Book, and the article "SoftwareSafety in Embedded Computer Systems," by Nancy Leveson( Communications of the ACM, Feb. 1995).

— William Everett

About the Authors

Bio Graphic
Bio Graphic
Address questions about this issue to Everett at AT&T Bell Labs,Room 3D416, PO Box 638, Murrary Hill, NJ 079740636;w.w.everettatt.com or to Honiden at Systems and SoftwareEngineering Laboratory, Toshiba Corp., 70 Yanagicho, Saiwaiku,Kawasaki 210, Japan; honidenssel.toshiba.co.jp.
FULL ARTICLE
60 ms
(Ver 3.x)