This Article 
 Bibliographic References 
 Add to: 
BioExtract Server—An Integrated Workflow-Enabling System to Access and Analyze Heterogeneous, Distributed Biomolecular Data
January-March 2010 (vol. 7 no. 1)
pp. 12-24
Carol Lushbough, University of South Dakota, Vermillion
Michael K. Bergman, VisualMetrics Corporation, Coralville
Carolyn J. Lawrence, USDA-ARS Iowa State University, Ames
Doug Jennewein, University of South Dakota, Vermillion
Volker Brendel, Iowa State University, Ames
Many in silico investigations in bioinformatics require access to multiple, distributed data sources and analytic tools. The requisite data sources may include large public data repositories, community databases, and project databases for use in domain-specific research. Different data sources frequently utilize distinct query languages and return results in unique formats, and therefore researchers must either rely upon a small number of primary data sources or become familiar with multiple query languages and formats. Similarly, the associated analytic tools often require specific input formats and produce unique outputs which make it difficult to utilize the output from one tool as input to another. The BioExtract Server ( is a Web-based data integration application designed to consolidate, analyze, and serve data from heterogeneous biomolecular databases in the form of a mash-up. The basic operations of the BioExtract Server allow researchers, via their Web browsers, to specify data sources, flexibly query data sources, apply analytic tools, download result sets, and store query results for later reuse. As a researcher works with the system, their “steps” are saved in the background. At any time, these steps can be preserved long-term as a workflow simply by providing a workflow name and description.

[1] S. Philippi, "Light-Weight Integration of Molecular Biological Databases," Bioinformatics, vol. 20, no. 1, pp. 51-57, 2004.
[2] L. Stein, "Integrating Biological Databases," Nature Rev. Genetics, vol. 4, no. 5, pp. 337-345, 2003.
[3] D.L. Wheeler, T. Barrett, D.A. Benson, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. DiCuccio, R. Edgar, S. Federhen, L.Y. Geer, Y. Kapustin, O. Khovayko, D. Landsman, D.J. Lipman, T.L. Madden, D.R. Maglott, J. Ostell, V. Miller, K.D. Pruitt, G.D. Schuler, E. Sequeira, S.T. Sherry, K. Sirotkin, A. Souvorov, G. Starchenko, R.L. Tatusov, T.A. Tatusova, L. Wagner, and E. Yaschenko, "Database Resources of the National Center for Biotechnology Information," Nucleic Acids Research, vol. 35, database issue, pp. D5-D12, 2007.
[4] V.M. Markowitz and O. Ritter, "Characterizing Heterogeneous Molecular Biology Database Systems," J. Computational Biology, vol. 2, no. 4, pp. 547-556, 1995.
[5] S.Y. Chung and J.C. Wooley, "Challenges Faced in the Integration of Biological Information," Bioinformatics: Managing Scientific Data, Z. Lacroix and T. Critchlow, eds., chapter 2, pp. 21-24, Morgan Kaufmann, 2003.
[6] S.B. Davidson, J. Crabtree, B.P. Brunk, J. Schug, V. Tannen, G.C. Overton, and C.J. Stoeckert Jr, "K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources," IBM Systems J., vol. 40, no. 2, pp. 512-531, http://www.gusdb.orgabout.php, 2001.
[7] T.J. Lee, Y. Pouliot, V. Wagner, P. Gupta, D.W.J. Stringer-Calvert, J.D Tenenbaum, and P.D. Karp, "BioWarehouse: A Bioinformatics Database Warehouse Toolkit," BMC Bioinformatics, vol. 7, p. 170, http:/, 2006.
[8] E. Zdobnov, R. Lopez, R. Apweiler, and T. Etzold, "The EBI SRS Server—Recent Developments," Bioinformatics, vol. 18, no. 2, pp. 368-373, 2002.
[9] V. Tannen, S. Davidson, and S. Harker, "The Information Integration System K2," Bioinformatics: Managing Scientific Data, Z. Lacroix and T. Critchlow, eds., chapter 8, pp. 225-248, Morgan Kaufmann, 2003.
[10] R. Stevens, P. Baker, S. Bechhofer, G. Ng, A. Jacoby, N. Paton, C. Goble, and A. Brass, "TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources," Bioinformatics, vol. 16, no. 2, pp. 184-186, 2000.
[11] M.W. Bright, A.R. Hurson, and S.H. Pakzad, "A Taxonomy and Current Issues in Multidatabase Systems," Computer, vol. 25, no. 3, pp. 50-60, 1992.
[12] A.P. Sheth and J.A. Larson, "Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases," ACM Computing Surveys, vol. 22, no. 3, pp. 183-236, 1990.
[13] G. Wiederhold and M. Genesereth, "The Conceptual Basis for Mediation Services," IEEE Expert, vol. 12, no. 5, pp. 38-47, 1997.
[14] , 2008.
[15] M. Galperin, "The Molecular Biology Database Collection: 2007 Update," Nucleic Acids Research, vol. 35, database issue, pp. D3-D4, 2007.
[16] H. Sun, S. Palaniswamy, T. Pohar, V. Jin, and R.V. Davuluri, "MPromDb: An Integrated Resource for Annotation and Visualization of Mammalian Gene Promoters and ChIP-Chip Experimental Data," Nucleic Acids Research, vol. 34, database issue, pp. D98-D103, 2006.
[17] S. Griffiths-Jones, R.J. Grocock, S. van Dongen, A. Bateman, and A.J. Enright, "miRBase: microRNA Sequences, Targets and Gene Nomenclature," Nucleic Acids Research, vol. 34, database issue, pp. D140-D144, 2006.
[18] A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats, and S.R. Eddy, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 32, database issue, pp. D138-D141, 2004.
[19] A. Chatr-aryamontri, A. Ceol, L.M. Palazzi, G. Nardelli, M.V. Schneider, L. Castagnoli, and G. Cesareni, "MINT: The Molecular INTeraction Database," Nucleic Acids Research, vol. 35, database issue, pp. D572-D574, 2007.
[20] J. Demeter, C. Beauheim, J. Gollub, T. Hernandez-Boussard, H. Jin, D. Maier, J.C. Matese, M. Nitzberg, F. Wymore, Z.K. Zachariah, P.O. Brown, G. Sherlock, and C.A. Ball, "The Stanford Microarray Database: Implementation of New Analysis Tools and Open Source Release of Software," Nucleic Acids Research, vol. 35, database issue, pp. D766-D770, 2007.
[21] D. Benson, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and D. Wheller, "GenBank," Nucleic Acids Research, vol. 34, database issue, pp. D16-D20, 2007.
[22] T. Kulikova, R. Akhtar, P. Aldebert, N. Althorpe, M. Andersson, A. Baldwin, K. Bates, S. Bhattacharyya, L. Bower, P. Browne, M. Castro, G. Cochrane, K. Duggan, R. Eberhardt, N. Faruque, G. Hoad, C. Kanz, C. Lee, R. Leinonen, Q. Lin, V. Lombard, R. Lopez, D. Lorenc, H. McWilliam, G. Mukherjee, F. Nardone, M. Pilar Garcia Pastor, S. Plaister, S. Sobhany, P. Stoehr, R. Vaughan, D. Wu, W. Zhu, and R. Apweiler, "EMBL Nucleotide Sequence Database in 2006," Nucleic Acids Research, vol. 35, database issue, pp. D16-D20, 2007.
[23] J. Duvick, A. Fu, U. Muppirala, M. Sabharval, M.D. Wilkerson, C.J. Lawrence, C. Lushbough, and V. Brendel, "PlantGDB: A Resource for Comparative Plant Genomics," Nucleic Acids Research, vol. 36, database issue, 2007, doi: 10.1093/nar/gkm1041.
[24] C. Lushbough and T. Tiahrt, "Field Stream Database System— Data Mining Storage for Biological Data," unpublished, , 2005.
[25] S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.
[26] M. Senger, P. Rice, and T. Oinn, "Soaplab—A Unified Sesame Door to Analysis Tools," Proc. UK e-Science All Hands Meeting '03, pp. 509-513, pdf115.pdf, Sept. 2003.
[27] M.D. Wilkinson and M. Links, "BioMOBY: An Open-Source Biological Web Services Proposal," Brief Bioinformatics, vol. 3, no. 4, pp. 331-341, 2002.
[28] D. Hull, R. Stevens, P. Lord, and C. Goble, "Integrating Bioinformatics Resources Using Shims," Proc. 12th Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '04), , 2004.
[29] E. Deelman and Y. Gil, Workshop on the Challenges of Scientific Workflows, sponsored by the Nat'l Science Foundation, http://www.isi.edunsf-workflows06, May 2006.
[30] T. Barkman, T. Martins, E. Sutton, and J. Stout, "Positive Selection for Single Amino Acid Change Promotes Substrate Discrimination of a Plant Volatile-Producing Enzyme," Molecular Biology and Evolution, vol. 24, no. 6, pp. 1320-1329, 2007.
[31] R. Chenna, H. Sugawara, T. Koike, R. Lopez, T.J. Gibson, D.G. Higgins, and J.D. Thompson, "Multiple Sequence Alignment with the Clustal Series of Programs," Nucleic Acids Research, vol. 31, no. 13, pp. 3497-3500, 2003.
[32] A. Buccella and A. Cechich, "An Ontology Approach to Data Integration," J. Computer Science and Technology, vol. 3, no. 2, pp. 62-68, 2003.
[33] C. Pluempitiwiriyawej and J. Hammer, "A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources," Technical Report TR00-004, Univ. of Florida, tr00-004.pdf, Sept. 2000.
[34] J. Yu and R. Buyya, "A Taxonomy of Scientific Workflow Systems for Grid Computing," SIGMOD Record, vol. 34, no. 3, pp. 44-49, Sept. 2005.
[35] Sun Microsystems, Inc., "Java Message Service," http://java.sun. com/products/jmsdocs.html , 2002.
[36] Sun Microsystems, Inc., Sun ONE Application Framework Overview,, 2002.
[37] B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E.A. Lee, J. Tao, and Y. Zhao, "Scientific Workflow Management and the Kepler System," Concurrency and Computation: Practice and Experience, vol. 18, no. 10, pp. 1039-1065, 2006.
[38] L.M. Haas, P.M. Schwarz, E. Kodali, E. Kotlar, J. Rice, and W.C. Swope, "DiscoveryLink: A System for Integrated Access to Life Sciences Data Sources," IBM System J., vol. 40, no. 2,0018-8670/01, 2001.
[39] D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, and T. Oinn, "Taverna: A Tool for Building and Running Workflows of Services," Nucleic Acids Research, vol. 34, Web Server issue, pp. W729-W732, 2006.
[40] T. Hernandez and S. Kambhampati, "Integration of Biological Sources: Current Systems and Challenges Ahead," Proc. ACM SIGMOD '04, vol. 33, no. 3, pp. 51-60, 2004.
[41] N.W. Paton, R. Stevens, P. Baker, C.A. Goble, S.S. Bechhofer, and A. Brass, "Query Processing in the TAMBIS Bioinformatics Source Integration System," Proc. 11th Int'l Conf. Scientific and Statistical Database Management (SSDBM '99), pp. 138-147, 1999.
[42] R. Stevens, C. Goble, N. Paton, S. Bechhofer, G. Ng, P. Baker, and A. Brass, "Complex Query Formulation over Diverse Information Sources Using an Ontology," Proc. Workshop Computation of Biochemical Pathways and Genetic Networks, European Media Lab (EML '99), eml99.pdf, pp. 83-88, 1999.
[43] A. Goderis, C. Brooks, I. Altintas, E.A. Lee, and C.A. Goble, "Composing Different Models of Computation in Kepler and Ptolemy II," Proc. Int'l Conf. Computational Science (ICCS '07), MygridPapersStore?rev=1;filename=final_in_8_pages.pdf , May 2007.
[44] Kepler Project, "Getting Started with Kepler," http://kepler-project.orgWiki.jsp?page=Documentation , 2008.
[45] S. Bowers, T. McPhillips, S. Riddle, M. Anand, and B. Ludaescher, "Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life," Proc. Int'l Provenance Annotation Workshop (IPAW), 2008.
[46] A. Barker and J. van Hemert, "Scientific Workflow: A Survey and Research Directions," Proc. Seventh Int'l Conf. Parallel Processing and Applied Math. (PPAM '08), revised selected papers, Roman Wyrzykowski et al., eds., pp. 746-753, 2008.
[47] D. Butler, "Mashups Mix Data into Global Service," Nature, vol. 439, pp. 6-7, 2006.
[48] P. Ferragina and R. Grossi, "The String B-Tree: A New Data Structure for String Search in External Memory and Its Applications," J. ACM, vol. 46, no. 2, pp. 236-280, 1999.
[49] S. Heinz, J. Zobel, and H. Williams, "Burst Tries: A Fast, Efficient Data Structure for String Keys," ACM Trans. Information Systems, vol. 20, no. 2, pp. 192-223, 2002.
[50] N. Askitis and R. Sinha, "HAT-Trie: A Cache-Conscious Trie-Based Data Structure for Strings," Proc. 30th Australasian Conf. Computer Science (ACSC '07), vol. 62, pp. 97-105, 2007.
[51] M. Cameron and H. Williams, "Comparing Compressed Sequences for Faster Nucleotide BLAST Searches," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 349-360, July-Sept. 2007.
[52] G. Navarro and V. Mäkinen, "Compressed Full-Text Indexes," ACM Computing Surveys, vol. 39, no. 1, pp. 1-61, 2007.

Index Terms:
Bioinformatics (genome or protein) databases, data integration, distributed architectures, heterogeneous databases, mash-up, scientific workflow automation.
Carol Lushbough, Michael K. Bergman, Carolyn J. Lawrence, Doug Jennewein, Volker Brendel, "BioExtract Server—An Integrated Workflow-Enabling System to Access and Analyze Heterogeneous, Distributed Biomolecular Data," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 12-24, Jan.-March 2010, doi:10.1109/TCBB.2008.98
Usage of this product signifies your acceptance of the Terms of Use.