This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Recording Process Documentation for Provenance
September 2009 (vol. 20 no. 9)
pp. 1246-1259
Paul Groth, University of Southern California, Marina del Rey
Luc Moreau, University of Southampton, Southampton
Scientific and business communities are adopting large-scale distributed systems as a means to solve a wide range of resource-intensive tasks. These communities also have requirements in terms of provenance. We define the provenance of a result produced by a distributed system as the process that led to that result. This paper describes a protocol for recording documentation of a distributed system's execution. The distributed protocol guarantees that documentation with characteristics suitable for accurately determining the provenance of results is recorded. These characteristics are confirmed through a number of proofs based on an abstract state machine formalization.

[1] P.C. Bates, “Debugging Heterogeneous Distributed Systems Using Event-Based Models of Behavior,” ACM Trans. Computer Systems, vol. 13, no. 1, pp. 1-31, 1995.
[2] R. Bose and J. Frew, “Lineage Retrieval for Scientific Data Processing: A Survey,” ACM Computing Surveys, vol. 37, no. 1, pp. 1-28, 2005.
[3] P. Buneman, S. Khanna, and W. Tan, “Why and Where: A Characterization of Data Provenance,” Proc. Int'l Conf. Databases Theory (ICDT '01), pp. 316-330, 2001.
[4] K.M. Chandy and L. Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems,” ACM Trans. Computer Systems, vol. 3, no. 1, pp. 63-75, 1985.
[5] Y. Cui, J. Widom, and J.L. Wiener, “Tracing the Lineage of View Data in a Warehousing Environment,” ACM Trans. Database Systems, vol. 25, no. 2, pp. 179-227, 2000.
[6] E.N.M. Elnozahy, L. Alvisi, Y.-M. Wang, and D.B. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
[7] J.N. Foster, M.B. Greenwald, C. Kirkegaard, B.C. Pierce, and A. Schmitt, “Exploiting Schemas in Data Synchronization,” Proc. Int'l Symp. Database Programming Languages (DBPL '05), pp. 42-57, 2005.
[8] P. Groth, M. Luck, and L. Moreau, “Formalising a Protocol for Recording Provenance in Grids,” Proc. UK OST e-Science Second All Hands Meeting (AHM '04), Sept. 2004.
[9] P. Groth, M. Luck, and L. Moreau, “A Protocol for Recording Provenance in Service-Oriented Grids,” Proc. Eighth Int'l Conf. Principles of Distributed Systems (OPODIS '04), T. Higashino, ed., pp. 124-139, Dec. 2004.
[10] P. Groth, S. Miles, W. Fang, S.C. Wong, K.-P. Zauner, and L. Moreau, “Recording and Using Provenance in a Protein Compressibility Experiment,” Proc. 14th IEEE Int'l Symp. High Performance Distributed Computing (HPDC), 2005.
[11] P. Groth, S. Miles, and L. Moreau, “A Model of Process Documentation to Determine Provenance in Mash-Ups,” Trans. Internet Technology, 2008.
[12] P. Groth, S. Miles, V. Tan, and L. Moreau, “Architecture for Provenance Systems,” technical report, Univ. of Southampton, http://eprints.ecs.soton.ac.uk11310/, Oct. 2005.
[13] J. Joyce, G. Lomow, K. Slind, and B. Unger, “Monitoring Distributed Systems,” ACM Trans. Computer Systems, vol. 5, no. 2, pp. 121-150, 1987.
[14] J. Ledlie, C. Ng, D.A. Holland, K.-K. Muniswamy-Reddy, U. Braun, and M. Seltzer, “Provenance-Aware Sensor Data Storage,” Proc. First Int'l Workshop Networking Meets Databases (NetDB '05), Apr. 2005.
[15] S. Miles, P. Groth, M. Branco, and L. Moreau, “The Requirements of Using Provenance in e-Science Experiments,” J. Grid Computing, vol. 5, no. 1, pp. 1-25, 2007.
[16] L. Moreau, P. Dickman, and R. Jones, “Birrell's Distributed Reference Listing Revisited,” ACM Trans. Programming Languages and Systems, vol. 27, no. 6, pp. 1344-1395, Nov. 2005.
[17] L. Moreau and I. Foster, eds., Provenance and Annotation of Data—Proc. Int'l Provenance and Annotation Workshop (IPAW '06), Springer-Verlag, May 2006.
[18] L. Moreau, P. Groth, S. Miles, J. Vazquez, J. Ibbotson, S. Jiang, S. Munroe, O. Rana, A. Schreiber, V. Tan, and L. Varga, “The Provenance of Electronic Data,” Comm. ACM, Apr. 2008.
[19] A. Schreiber, “The Integrated Simulation Environment TENT,” Concurrency and Computation: Practice and Experience, vol. 14, nos.13-15, Jan. 2003.
[20] C.D. Snow, H. Ngyen, V.S. Pande, and M. Gruebele, “Absolute Comparison of Simulated and Experimental Protein-Folding Dynamics,” Nature, vol. 420, pp. 102-106, 2002.
[21] W.-C. Tan, “Research Problems in Data Provenance,” IEEE Data Eng. Bull., vol. 27, no. 4, pp. 45-52, 2004.
[22] S. Tezuka, H. Murata, S. Tanaka, and S. Yumae, “Monte Carlo Grid for Financial Risk Management,” Future Generation Computer Systems, vol. 21, no. 5, pp. 811-821, 2005.
[23] A. Tridgell, “Efficient Algorithms for Sorting and Synchronization,” PhD thesis, Australian Nat'l Univ., Feb. 1999.
[24] J. Zhao, C. Goble, R. Stevens, and D. Turi, “Mining Taverna's Semantic Web of Provenance,” J. Concurrency and Computation: Practice and Experience, 2007.

Index Terms:
Provenance, lineage, grids, distributed systems, data protocols.
Citation:
Paul Groth, Luc Moreau, "Recording Process Documentation for Provenance," IEEE Transactions on Parallel and Distributed Systems, vol. 20, no. 9, pp. 1246-1259, Sept. 2009, doi:10.1109/TPDS.2008.215
Usage of this product signifies your acceptance of the Terms of Use.