, IBM T.J. Watson Research Center
, IBM T.J. Watson Research Center
, University of Utah
, Wright State University
Pages: pp. 14-21
The Web has completely changed the way in which we share data, rapidly shifting us from a world of paper documents to a world of digital objects that include online documents, videos, photos, artwork, and databases. This shift has also made data management an increasingly complex problem as applications take advantage of loosely coupled resources brought together by distributed computing systems and abundant storage capacity. It's now easier than ever to modify documents, particularly with the help of general-purpose specifications such as XML, and extract data from documents or databases through the use of technologies such as query languages, REST interfaces, and Web service interconnectivity.
It's likewise easier to modify and update digital objects, and to do so collaboratively, via social collaboration platforms such as YouTube, Flickr, Facebook, Second Life, and Many Eyes ( http://manyeyes.alphaworks.ibm.com). But as increasing volumes of data are shared and modified, it's crucial to track their provenance. Stemming from the French word provenir ("to come from"), provenance means the origin, or the source, of something, or the history of an object's ownership or location. A digital object's provenance (also referred to as audit trail and lineage) contains information about both the process and data used to derive the object. Provenance also provides documentation that's vital to preserving data, determining the data's quality and authorship, and reproducing as well as validating results. From the ability to reproduce digital objects to assessing data quality to enabling the enforcement of intellectual property rights and composite licensing, the provenance of digital items published and exchanged over the Web is exceedingly important. 1,2
Provenance has many different and compelling applications.
Geographically dispersed businesses have to manage data aggregated from different parts of the enterprise into a data warehouse. Business provenance gives the flexibility to selectively capture information required to address a specific compliance or performance goal. 3 Additionally, correlation mechanisms built on top of provenance stores can yield a representation of end-to-end operations that puts each business artifact into the right context. Execution traces of end-to-end business operations generated by provenance can capture an enterprise's operational aspects, enable modeling and predictive analytics for the business process represented by the traces, 4 and measure compliance to business rules and regulations. 3
Provenance is essential in science. Because reproducibility is the cornerstone of the scientific process, detailed provenance must be captured so that researchers can reproduce and validate results. Provenance is particularly important when computationally intensive science is carried out in highly distributed network environments using Internet-based collaboration tools. 5,6 More recently, with the emergence of open science in which data is widely shared and social tools are available that allow scientists to collaboratively explore data and solve problems (see myexperiment.org, www.crowdlabs.org, and www.nanohub.org), provenance is needed for tracking how experimental data is exchanged and contributed to by many different people over potentially long periods of time. Lately, the issue of publishing reproducible research has started to receive attention in the scientific community. 7
Provenance is vital from a social networking and Web 2.0 perspective as well. 8 Relationship discovery and community detection can be achieved on the basis of information aggregated from blogs, social bookmarking tools (such as IBM's Dogear and Delicious.com), and social networking sites. While tracking the provenance of a user's tagging behavior can give insight into his or her relationships, tracking how social networks evolve can potentially shed light into how people interact in the digital world.
In sensor networks, we can combine raw data from heterogeneous sensors with background knowledge — and a variety of analytical and reasoning support — to deliver improved situational awareness to end users. 9 In the process, original data is transformed, merged, and process in myriad ways, so the provenance can be a key tool in addressing challenges such as trustworthiness of both data and decisions.
A provenance management solution must deal with three main problems: how to capture provenance, which information to capture and how to model it, and how to store and efficiently access the information.
Different provenance capture mechanisms are available, depending on the tools and environment in which digital objects are created. 10 For computational tasks specified as a workflow, the workflow engine can capture the tasks' steps, parameters, and data used; execution information; and user-specified annotations. 10,11 Workflow systems such as Taverna ( http://taverna.sourceforge.net), Kepler ( http://kepler-project.org), and VisTrails ( www.vistrails.org) support provenance capture.
Process-based provenance capture mechanisms require each service or process involved in a computational task to document itself, with any information derived from autonomous processes pieced together to provide documentation for composite tasks. Operating system- (OS-) based mechanisms require no modification to existing scripts or programs. Instead, they rely on the OS environment's ability to transparently capture data and data process dependencies at the kernel (via the file system interface) or user levels (via the system call tracer). Because there's no formal specification associated with a task, in OS-based approaches, the provenance information is obtained by extracting relationships between system calls and tasks. When we consider social and sensor data, or citizen sensing reported via mobile devices (for example, a tweet report using a smartphone), we discover a large variety of interesting forms of metadata potentially relevant to provenance, such as user profile, device-collected metadata (location and GPS information), and time and sensor-related metadata (accelerometer information and the user's cultural background).
Different models support different kinds of provenance, including retrospective provenance, which represents the steps executed as well as information about the environment used to derive a specific data product (a detailed log of a computational task's execution), and prospective provenance, which captures the steps that must be followed to derive a particular type of data product. In essence, provenance is a graph that models data and process dependencies. For example, in scientific workflow systems, the provenance graph mirrors the workflow graph.
Despite a base commonality, provenance models tend to vary according to domain and user needs. Taverna, for instance, was developed to support the creation and management of workflows in the bioinformatics domain, so it provides an infrastructure that includes support for ontologies available in this domain. 12 VisTrails was designed to support exploratory tasks, such as simulations, data exploration, and visualization in which workflows are iteratively refined, and thus uses a model that treats workflow specifications as first-class data products and captures the provenance of workflow evolution. 13 Recently, there has been an effort to create an open model that allows provenance information to be freely exchanged across systems. 14 Provenir, a provenance ontology, advocates and supports the capture of semantic provenance — that is, domain-specific semantics (such as those specified using ontologies or domain models) — in addition to data and workflow provenance. 15
A wide variety of provenance storage and retrieval systems have been proposed, ranging from specialized Semantic Web languages and XML dialects stored as files to tuples stored in relational database tables. One of the advantages of file system storage is that users don't need additional infrastructure to store provenance information. However, a relational database does provide centralized, efficient storage that a group of users can share. Recently, researchers have attempted to explore the utility of a cloud architecture for storing data provenance; 16 those supporting semantic provenance prefer to use RDF, 15 which is now a broadly adopted Semantic Web language.
The Linked Open Data (LOD) initiative has also increased the availability of massive amounts of datasets on the Semantic Web. 1 In particular, it promotes the publication of data in machine-accessible format and linking among heterogeneous data items. Linked data is represented in RDF and can be queried using SPARQL. This large-scale initiative already consists of billions of interlinked data items, including scientific datasets that now form a large graph for easy result navigation.
A common feature across many approaches to querying provenance is that their solutions are closely tied to the storage models used. Hence, they require users to write queries in languages such as SQL, Prolog, and SPARQL. Although such general languages are useful to those already familiar with their syntax, they weren't designed specifically for provenance, which means simple queries can be awkward and complex to write. The VisTrails system uses a language specifically designed to query workflows and their provenance and includes a visual interface that lets users specify queries in the same environment they use to construct workflows. 10 Some provenance models use Semantic Web technology both to represent and query provenance information. Semantic Web languages such as RDF and OWL combined with SPARQL provide a natural way to model provenance graphs and the ability to represent complex knowledge, such as annotations and metadata. Recent work that demonstrates the scalability of Semantic Web infrastructures in handling large provenance stores is now emerging. 15
The four articles in this special issue address some of the challenges involved in constructing and using provenance today.
In the article "From Business Processes to Process Spaces," Hamid Reza Motahari-Nezhad, Boualem Benatallah, Fabio Casati, and Regis Saint-Paul propose a novel system architecture to capture business provenance by enabling the discovery and understanding of relationships between business or scientific process artifacts. They propose the "process space" as a new abstraction for process management in modern-day, dynamic, and distributed business process environments. Process-space management systems (PSMSs) will enable definition, analysis, and management of process spaces over process artifacts. Furthermore, they offer the notion of process views in a process space to represent the process execution from various perspectives (different systems, business functions, or users) and at various levels of abstractions (detailed or abstract).
Yannis Theoharis, Irini Fundulaki, Grigoris Karvounarakis, and Vassilis Christophides, in their article "On Provenance of Queries on Semantic Web Data," introduce abstract provenance models to capture the relationship between query results and source data by taking into account the query operators. This information can be recorded in the repository when the data is imported to compute appropriate annotations for different applications and users at a later time. They argue for the benefits of this approach in settings where data is materialized in repositories from various sources and there's a need to assess its quality afterward. Queries can combine data from different sources, some of which are trusted; multiple sources can be involved in alternative derivations of an item in the query result. To make trust judgments, more detailed provenance expressions are required that, in addition to provenance tokens, also record query operators involved in the derivation of a data item, thereby storing information on how input data items were combined to produce the resulting data item.
"Extending Semantic Provenance into the Web of Data," by Jun Zhao, Satya S. Sahoo, Paolo Missier, Amit Sheth, and Carole Goble, describes a single metadata architecture based on the Provenir upper-level provenance ontology that combines workflow provenance, semantics, domain-specific annotations, and LOD conventions to answer complex user queries in the context of a bioinformatics workflow. This article also describes Janus, a semantic and linked data-aware provenance infrastructure that operates on metadata produced by the Taverna workflow system. Janus demonstrates the use of semantic provenance to answer domain-specific user questions, the use of provenance query operators to implement those questions, and the use of semantics to expose provenance collected during workflow execution as part of the LOD cloud. It also demonstrates how LOD-aware provenance queries, not supported earlier in scientific workflows, can be answered.
Finally, "Papel: Provenance-Aware Policy Definition and Execution" by Christoph Ringelstein and Steffan Staab introduces a formal language that specifies the relationship between policy conditions and provenance information, based on the open provenance model. Existing policy languages aren't able to express policies that can make statements about the properties of data (and the flow of data), and this article seeks to fill this gap by enabling policy conditions to relate to provenance information.
The four articles in this special issue address only a handful of topics in provenance for Web applications. We anticipate that the growth of the Web, the increased sharing of scientific, social, and sensor data, and broad adoption of data sharing on the Web such as through the LOD initiative will fuel an explosion in the demand for provenance systems.
We express our gratitude to the authors, whose submissions made this special issue possible and showed how active and exciting this field is; the reviewers, who volunteered their time to read and provide constructive comments to help the authors improve the quality of their articles; the IEEE Internet Computing editorial staff members who helped us manage the submissions; and the magazine's editorial board for its guidance in the process.