2011 IEEE 27th International Conference on Data Engineering (2011)
Apr. 11, 2011 to Apr. 16, 2011
Lei Dou , UC Davis Genome Center, University of California, Davis, 95616, USA
Daniel Zinn , UC Davis Genome Center, University of California, Davis, 95616, USA
Timothy McPhillips , UC Davis Genome Center, University of California, Davis, 95616, USA
Sven Kohler , UC Davis Genome Center, University of California, Davis, 95616, USA
Sean Riddle , UC Davis Genome Center, University of California, Davis, 95616, USA
Shawn Bowers , Department of Computer Science, Gonzaga University, Spokane, WA 99258, USA
Bertram Ludascher , UC Davis Genome Center, University of California, Davis, 95616, USA
Scientific workflow systems are used to integrate existing software components (actors) into larger analysis pipelines to perform in silico experiments. Current approaches for handling data in nested-collection structures, as required in many scientific domains, lead to many record-management actors (shims) that make the workflow structure overly complex, and as a consequence hard to construct, evolve and maintain. By constructing and executing workflows from bioinformatics and geosciences in the Kepler system, we will demonstrate how COMAD (Collection-Oriented Modeling and Design), an extension of conventional workflow design, addresses these shortcomings. In particular, COMAD provides a hierarchical data stream model (as in XML) and a novel declarative configuration language for actors that functions as a middleware layer between the workflow's data model (streaming nested collections) and the actor's data model (base data and lists thereof). Our approach allows actor developers to focus on the internal actor processing logic oblivious to the workflow structure. Actors can then be re-used in various workflows simply by adapting actor configurations. Due to streaming nested collections and declarative configurations, COMAD workflows can usually be realized as linear data processing pipelines, which often reflect the scientific data analysis intention better than conventional designs. This linear structure not only simplifies actor insertions and deletions (workflow evolution), but also decreases the overall complexity of the workflow, reducing future effort in maintenance.
B. Ludascher et al., "Scientific workflow design 2.0: Demonstrating streaming data collections in Kepler," 2011 IEEE 27th International Conference on Data Engineering(ICDE), Hannover, Germany, 2011, pp. 1296-1299.