2017 IEEE International Conference on Services Computing (SCC) (2017)
Honolulu, Hawaii, United States
June 25, 2017 to June 30, 2017
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/SCC.2017.65
Provenance refers to the information about the derivation history of a data product. It is important for evaluating the quality and trustworthiness of a data product and ensuring the reproducibility of scientific discoveries. Much research has been done on storing and querying scientific workflow provenance - provenance that is produced in the execution of data-centric scientific workflows. To address the challenges of big data in increasing volume, velocity and variety, a new generation of scientific workflows, called big data workflows are under active research. As both data and workflows increase in their scale, the scale of provenance naturally increases, calling for a new scalable storage and querying infrastructure. This paper leverages Pig Latin, a high-level platform for creating programs that run on Apache Hadoop, and OPQL, a graph-level provenance query language, to build a scalable provenance storage and querying system for big data workflows. Our main contributions are: i) we propose algorithms to translate OPQL constructs to equivalent Pig Latin programs, ii) we extend OPQL, to support the W3C PROV-DM standard provenance model, iii) we develop and evaluate our system on provenance datasets from the UTPB benchmark, and (iv) we create some visual OPQL constructs in the DATAVIEW big data workflow system to facilitate the easy creation of complex OPQL queries in a visual workflow style. Our preliminary experimental study shows the feasibility of our framework for big-data-scale provenance storage and querying.
Data models, Big Data, Database languages, Engines, Xenon, Visualization, Standards
F. A. Bhuyan, S. Lu, D. Ruan and J. Zhang, "Scalable Provenance Storage and Querying Using Pig Latin for Big Data Workflows," 2017 IEEE International Conference on Services Computing (SCC), Honolulu, Hawaii, United States, 2017, pp. 459-466.