This Article 
 Bibliographic References 
 Add to: 
EvolvingSpace: A Data Centric Framework for Integrating Bioinformatics Applications
June 2010 (vol. 59 no. 6)
pp. 721-734
Chen Wang, CSIRO ICT Center, Australia
Bing Bing Zhou, The University of Sydney, Sydney
Albert Y. Zomaya, The University of Sydney, Sydney
The paper presents EvolvingSpace, a data centric distributed system, which is intended to address the data and application integration problem in bioinformatics data centers. The system employs commodity PCs for data storage and computation. EvolvingSpace manages data in a decentralized manner, which is convenient for storing data annotations and can eliminate potential data-access bottlenecks. It indexes distributed data in multilevels to facilitate the construction of complex workflows that consist of applications running on different types of data. In addition, the paper proposes a data locality and workflow aware scheduling algorithm (ES-Scheduling) to balance the data distribution and computing performance as well as throughput and workflow response time. We run extensive experiments using the system with real bioinformatics applications. Our results show that the system is efficient for running integrated bioinformatics applications and has good scalability.

[1] L. McGinnis and T.L. Madden, "BLAST: At the Core of a Powerful and Diverse Set of Sequence Analysis Tools," Nucleic Acids Research, vol. 32, Web Server issue, pp. W20-W25, July 2004.
[2] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, "Basic Local Alignment Search Tool," J. Molecular Biology, vol. 215, pp. 403-410, 1990.
[3] S.F. Altschul et al., "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.
[4] A.A. Schäffer et al., "Improving the Accuracy of PSI-BLAST Protein Database Searches with Composition-Based Statistics and other Refinements," Nucleic Acids Research, vol. 29, no. 14, pp. 2994-3005, 2001.
[5] M.G. Kann, P.A. Thiessen, A.R. Panchenko, A.A. Schäffer, S.F. Altschul, and S.H. Bryant, "A Structure-Based Method for Protein Sequence Alignment," Bioinformatics, vol. 21, no. 8, pp. 1451-1456, 2005.
[6] A. Halevy, A. Rajaraman, and J. Ordille, "Data Integration: The Teenage Years," Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB), pp. 9-16, 2006.
[7] Metascheduling Requirements Analysis Team, "User Requirements, RP Policies and Metascheduling Technologies—Charpter," 4/43 MetaSchedRatCharter-Final.doc , Aug. 2006.
[8] L.A. Barroso, J. Dean, and U. Holzle, "Web Search for a Planet: The Google Cluster Architecture," IEEE Micro, vol. 23, no. 2, pp. 22-28, Mar. 2003.
[9] A. Rowstron, P. Druschel, "Pastry: Scalable, Distributed Object Location and Routing for Large-Scale Peer-to-Peer Systems," Proc. IFIP/ACM Int'l Conf. Distributed Systems Platforms (Middleware), pp. 329-350, 2001.
[10] S. Sozaki and R. Ross, "Approximations in Finite Capacity Multiserver Queues with Poisson Arrivals," J. Applied Probability, vol. 13, pp. 826-834, 1978.
[11] R.W. Wolff, Stochastic Modeling and the Theory of Queues. Prentice Hall, 1989.
[12] M. Harchol-Balter, M. Crovella, and C. Murta, "On Choosing a Task Assignment Policy for a Distributed Server System," J. Parallel and Distributed Computing, vol. 59, no. 2, pp. 204-228, Nov. 1999.
[13] A. Wierman, T. Osogami, M. Harchol-Balter, and A. Scheller-Wolf, "How Many Servers are Best in a Dual-Priority M/PH/k System?" Performance Evaluation, vol. 63, no. 12, pp. 1253-1272, 2006.
[14] A. Bondi and J. Buzen, "The Response Times of Priority Classes Under Preemptive Resume in M/G/m Queues," ACM Sigmetrics, vol. 12, pp. 195-201, 1984.
[15] R.A. van Engelen and K. Gallivan, "The gSOAP Toolkit for Web Services and Peer-To-Peer Computing Networks," Proc. Second IEEE Int'l Symp. Cluster Computing and the Grid (CCGrid '02), pp. 128-135, May 2002.
[16] Oracle, "Anatomy of an XML Database: Oracle Berkeley DB XML (White Paper)," xmlindex.html, Sept. 2006.
[17] J.D. Thompson, D.G. Higgins, and T.J. Gibson, "CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Positions-Specific Gap Penalties and Weight Matrix Choice," Nucleic Acids Research, vol. 22, pp. 4673-4680, 1994.
[18] S. Guindon and O. Gascuel, "A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood," Systematic Biology, vol. 52, no. 5, pp. 696-704, 2003.
[19] M. Harchol-Balter, "Task Assignment with Unknown Duration," J. ACM, vol. 49, no. 2, pp. 260-288, 2002.
[20] J. Dean and S. Ghemawat S, "MapReduce: Simplified Data Processing on Large Clusters," Proc. Symp. Operating System Design and Implementation (OSDI), pp. 137-150, 2004.
[21] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks," European Conf. Computer Systems (EuroSys), Mar. 2007.
[22] S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google File System," Proc. 19th ACM Symp. Operating Systems Principles, pp. 20-43, 2003.
[23] V.S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum, "Locality-Aware Request Distribution in Cluster-Based Network Servers," Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 1998.
[24] C. Amza, A.L. Cox, and W. Zwaenepoel, "A Comparative Evaluation of Transparent Scaling Techniques for Dynamic Content Servers," Proc. Int'l Conf. Data Eng. (ICDE), 2005.
[25] D. Thain, T. Tannenbaum, and M. Livny, "Distributed Computing in Practice: The Condor Experience," Concurrency and Computation: Practice and Experience, vol. 17, nos. 2-4, pp. 323-356, 2005.
[26] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber, "Bigtable: A Distributed Storage System for Structured Data," Proc. Symp. Operating System Design and Implementation (OSDI), pp. 205-218, 2006.
[27] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: Amazon's Highly Available Key-Value Store," Proc. ACM Symp. Operating Systems Principles (SOSP), pp. 205-219, 2007.
[28] B. Yu, G. Li, B.C. Ooi, and L.-Z. Zhou, "One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing," Proc. Third Biennial Conf. Innovative Data Systems Research (CIDR), Jan. 2007.
[29] A. Halevy, M. Franklin, and D. Maier, "Principles of Dataspace Systems," Proc. ACM Symp. Principles of Database Systems (PODS), 2006.
[30] X. Dong and A.Y. Halevy, "Indexing Dataspaces," Proc. SIGMOD Conf., pp. 43-54, 2007.
[31] D. Gelernter, "Generative Communication in Linda," ACM Trans. Programming Languages and Systems, vol. 7, pp. 80-112, 1985.
[32] E. Freeman, S. Hupfer, and K. Arnold, JavaSpaces Principles, Patterns, and Practice. Addison-Wesley, 1999.

Index Terms:
Distributed systems, data sharing, workflow management, data models, scheduling, bioinformatics.
Chen Wang, Bing Bing Zhou, Albert Y. Zomaya, "EvolvingSpace: A Data Centric Framework for Integrating Bioinformatics Applications," IEEE Transactions on Computers, vol. 59, no. 6, pp. 721-734, June 2010, doi:10.1109/TC.2010.39
Usage of this product signifies your acceptance of the Terms of Use.