This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Handling Timing Errors in Distributed Programs
October 1988 (vol. 14 no. 10)
pp. 1525-1535

The authors describe a tool called TAP, which is defined to aid the programmer in discovering the causes of timing errors in running programs. TAP is similar to a postmortem debugger, using the history of interprocess communication to construct a timing graph, a directed graph where an edge joins node x to node y if event x directly precedes event y in time. The programmer can then use TAP to look at the graph to find the events that occurred in an unacceptable order. Because of the nondeterministic nature of distributed programs, the authors feel a history-keeping mechanism but always be active so that bugs can be dealt with as they occur. The goal is to collect enough information at run time to construct the timing graph if needed. Since it is always active, this mechanism must be efficient. The authors also describe experiments run using TAP and report the impact that TAP's history-keeping mechanism has on the running time of various distributed programs.

[1] G. T. Almes, M. J. Fischer, H. Golde, E. D. Lazowska, and J. D. Noe, "Eden project proposal," Univ. Washington, Seattle, WA, Tech. Rep. TR 80-10-01, Oct. 1980.
[2] S. Artsy, H.-Y. Chang, and R. Finkel, "Charlotte: Design and implementation of a distributed kernel," Tech. Rep. 554, Dep. Comput. Sci., Univ. Wisconsin, Madison, WI, Sept. 1984.
[3] F. Baiardi, N. DeFrancesco, and E. Matteoli, "Development of a debugger for a concurrent language," inProc. ACM SIGSOFT-SIGPLAN Software Eng. Symp. High Level Debugging, Asilomar, CA, Mar. 1983, pp. 6-22.
[4] P. Bates and J. C. Wileden, "Event Definition Language: An aid to monitoring and debugging complex software systems," Univ. Mass., Amherst, MA, COINS Tech. Rep. 81-17, 1981.
[5] B. Bruegge and P. Hibbard, "Generalized Path Expressions: A High-Level Debugging Mechanism,"Proc. ACM/SIGPlan Software Eng. Symp. High-Level Debugging, ACM Press, New York, Order No. 593830, 1983, pp. 34-44.
[6] R. Cook, R. Finkel, D. Dewitt, L. Landweber, and T. Virgilio, "The crystal nugget: Part I of the first report on the crystal project," Dep. Comput. Sci., Univ. Wisconsin, Madison, WI, Tech. Rep. 499, Apr. 1983.
[7] D. DeWitt, R. Finkel, and M. Solomon, "The Crystal multicomputer: Design and implementation experience," Dep. Comput. Sci., Univ. Wisconsin, Madison, WI, Tech. Rep. 553, Sept. 1984.
[8] R. A. Finkel, M. H. Solomon,et al., "Charlotte: Part IV of the first report on the Crystal project," Dep. Comput. Sci., Univ. Wisonsin, Madison, WI, Tech. Rep. 502, 1983.
[9] R. A. Finkel, R. Cook, D. Dewitt, N. Hall, and L. Landweber, "Wisconsin Modula," Dep. Comput. Sci., Univ. Wisonsin, Madison, WI, Tech. Rep. 501, Apr. 1983.
[10] H. Garcia-Molina, F. Germano, and W. H. Kohler, "Debugging a distributed computing system,"IEEE Trans. Software Eng., SE-10, pp. 210-219, Mar. 1984.
[11] A. J. Gordon, "Ordering errors in distributed programs," Dep. Comput. Sci., Univ. Wisconsin, Madison, WI, Tech. Rep. 611, Aug. 1985.
[12] T. Gross and W. Zwaenpoel, "System support for multiprocess debugging," inProc. ACM SIGSOFT-SIGPLAN Software Eng. Symp. High Level Debugging, Asilomar, CA, Mar. 1983, pp. 192-196.
[13] J. Joyce, G. Lomow, K. Slind, and B. Unger, "Monitoring distributed systems,"ACM Trans. Comput. Syst., vol. 5, no. 2, pp. 121- 150, May 1987.
[14] L. Lamport, "Time, clocks, and the ordering of events in a distributed system,"Commun. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[15] R. J. LeBlanc, "Position statement: Interactive debugging of distributed programs," inProc. ACM SIGSOFT-SIGPLAN Software Eng. Symp. High Level Debugging, Asilomar, CA, Mar. 1983, pp. 250- 253.
[16] G. McDaniel, "METRIC: A kernel instrumentation system for distributed environments," inProc. Sixth Symp. Oper. Syst. Principles, Purdue Univ., Nov. 1975, pp. 93-99.
[17] B. P. Miller, "Performance characterization of distributed programs," Comput. Sci. Div. (EECS), Univ. Calif., Berkeley, CA, UCB/CSD 85/197, Jan. 1985.
[18] D. Owen and A. Ramsay, "An environment for distributed computing," inProc. 2nd Int. Conf. Distributed Comput. Syst., Paris, 1981, pp. 173-179.
[19] D. Philips: "Black-flag," Dep. Comput. Sci., Carnegie-Mellon Univ. Pittsburgh, PA, draft, June 1982.
[20] R. D. Schiffenbauer, "Interactive debugging in a distributed computational environment," Dep. Comput. Sci., Mass. Inst. Technology, Cambridge, MA, Tech. Rep. MIT/LCS/TR-264, Sept. 1981.
[21] E. T. Smith: "Debugging techniques for communicating, loosely-coupled processes," inProc. ACM SIGSOFT-SIGPLAN Software Eng. Symp. High Level Debugging, Asilomar, CA, Mar. 1983.

Index Terms:
distributed programs; TAP; timing errors; postmortem debugger; interprocess communication; timing graph; directed graph; history-keeping mechanism; directed graphs; distributed processing; program testing; software tools
Citation:
"Handling Timing Errors in Distributed Programs," IEEE Transactions on Software Engineering, vol. 14, no. 10, pp. 1525-1535, Oct. 1988, doi:10.1109/32.6197
Usage of this product signifies your acceptance of the Terms of Use.