This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids
January 2010 (vol. 21 no. 1)
pp. 33-46
Christopher Moretti, University of Notre Dame, Notre Dame
Hoang Bui, University of Notre Dame, Notre Dame
Karen Hollingsworth, University of Notre Dame, Notre Dame
Brandon Rich, University of Notre Dame, Notre Dame
Patrick Flynn, University of Notre Dame, Notre Dame
Douglas Thain, University of Notre Dame, Notre Dame
Today, campus grids provide users with easy access to thousands of CPUs. However, it is not always easy for nonexpert users to harness these systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achieve very poor performance. To address this problem, we argue that campus grids should provide end users with high-level abstractions that allow for the easy expression and efficient execution of data-intensive workloads. We present one example of an abstraction—All-Pairs—that fits the needs of several applications in biometrics, bioinformatics, and data mining. We demonstrate that an optimized All-Pairs abstraction is both easier to use than the underlying system, achieve performance orders of magnitude better than the obvious but naive approach, and is both faster and more efficient than a tuned conventional approach. This abstraction has been in production use for one year on a 500 CPU campus grid at the University of Notre Dame and has been used to carry out a groundbreaking analysis of biometric data.

[1] S. Ahuja, N. Carriero, and D. Gelernter, “Linda and Friends,” Computer, vol. 19, no. 8, pp. 26-34, Aug. 1986.
[2] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 3, no. 215, pp. 403-410, Oct. 1990.
[3] A. Andoni and P. Andyk, “Near Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions,” Comm. ACM, vol. 51, no. 1. pp. 117-122, 2008.
[4] A. Arpaci-Dussea, R. Arpaci-Dusseau, and D. Culler, “High Performance Sorting on Networks of Workstations,” Proc. ACM SIGMOD, May 1997.
[5] D. Bakken and R. Schlichting, “Tolerating Failures in the Bag-of-Tasks Programming Paradigm,” Proc. IEEE Int'l Symp. Fault Tolerant Computing, June 1991.
[6] R. Bayardo, Y. Ma, and R. Srikant, “Scaling Up All Pairs Similarity Search,” Proc. World Wide Web Conf., May 2007.
[7] M. Beynon, R. Ferreira, T. Kurc, A. Sussman, and J. Saltz, “Middleware for Filtering Very Large Scientific Data Sets on Archival Storage Systems,” Proc. IEEE Symp. Mass Storage Systems, 2000.
[8] R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou, “Cilk: An Efficient Multithreaded Runtime System,” Proc. ACM SIGPLAN Notices, vol. 30, Aug. 1995.
[9] K. Bowyer, K. Hollingsworth, and P. Flynn, “Image Understanding for Iris Biometrics: A Survey,” Computer Vision and Image Understanding, vol. 110, no. 2, pp. 281-307, 2007.
[10] Center for Biometrics and Security Research, CASIA Iris Image Database, http://www.cbsr.ia.ac.cn/englishDatabases.asp , Apr. 2008.
[11] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” Proc. Conf. Operating Systems Design and Implementation, 2006.
[12] T. Cheatham, A. Fahmy, D. Siefanescu, and L. Valiani, “Bulk Synchronous Parallel Computing—A Paradigm for Transportable Software,” Proc. Hawaii Int'l Conf. Systems Sciences, 2005.
[13] D. da Silva, W. Cirne, and F. Brasilero, “Trading Cycles for Information: Using Replication to Schedule Bag-of-Tasks Applications on Computational Grids,” Proc. Conf. Euro-Par, 2003.
[14] L. Dagum and R. Menon, “OpenMP: An Industry Standard Api for Shared Memory Programming,” IEEE Computational Science and Eng., vol. 5, no. 1, pp. 46-55, Jan.-Mar. 1998.
[15] J. Daugman, “How Iris Recognition Works,” IEEE Trans. Circuits and Systems for Video Technology, vol. 14, no. 1, pp. 21-30, Jan. 2004.
[16] J. Dean and S. Ghemawat, “Mapreduce: Simplified Data Processing on Large Cluster,” Proc. Conf. Operating Systems Design and Implementation, 2004.
[17] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, B. Berriman, J. Good, A. Laity, J. Jacob, and D. Katz, “Pegasus: A Framework for Mapping Complex Scientific Workflows onto Distributed Systems,” Scientific Programming J., vol. 13, no. 3, 2005.
[18] J.J. Dongarra and D.W. Walker, “MPI: A Standard Message Passing Interface,” Supercomputer, pp. 56-68, Jan. 1996.
[19] J. Douceur and W. Bolovsky, “ A Large Scale Study of File-System Contents,” Proc. ACM SIGMETRICS '99, May 1999.
[20] T. Elsayed, J. Lin, and D. Oard, “Pairwise Document Similarity in Large Collections with Mapreduce,” Proc. 48th Ann. Meeting of the Assoc. Computational Linguistics, 2008.
[21] I. Foster and C. Kesselman, “Globus: A Metacomputing Intrastructure Toolkit,” Int'l J. Supercomputer Applications, vol. 11, no. 2, pp. 115-128, 1997.
[22] I. Foster, J. Voeckler, M. Wilde, and Y. Zhou, “Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation,” Proc. 14th Conf. Scientific and Statistical Database Management, July 2002.
[23] S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and D. Culler, “Scalable, Distributed Data Structures for Internet Service Construction,” Proc. USENIX Conf. Operating Systems Design and Implementation, Oct. 2000.
[24] Hadoop, http:/hadoop.apache.org/, 2007.
[25] P. Havlak, R. Chen, K.J. Durbin, A. Egan, Y. Ren, X.-Z. Song, G.M. Weinstock, and R. Gibbs, “The Atlas Genome Assembly System,” Genome Research, vol. 14, pp. 721-732, 2004.
[26] X. Huang, J. Wang, S. ALuru, S.-P. Yang, and L. Hillier, “PCAP: A Whole-Genome Assembly Program,” Genome Research, vol. 13, pp.2164-2170, 2003.
[27] L. Huston, R. Sukthankar, R. Wickremesinghe, M. Satyanarayanan, G.R. Ganger, E. Riedel, and A. Ailamaki, “Diamond: A Storage Architecture for Early Discard in Interactive Search,” Proc. USENIX Conf. File and Storage Technologies (FAST), 2004.
[28] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: Distributed Data Parallel Programs from Sequential Building Blocks,” Proc. European Conf. Computer Systems (EuroSys '07), Mar. 2007.
[29] J. Linderoth, S. Kulkarni, J.-P. Goux, and M. Yoder, “An Enabling Framework for Master-Worker Applications on the Computational Grid,” Proc. IEEE Int'l Symp. High Performance Distributed Computing, pp. 43-50, Aug. 2000.
[30] D. Lui and M. Franklin, “GridDB: A Data Centric Overlay for Scientific Grids,” Proc. Int'l Conf. Very Large Databases (VLDB), 2004.
[31] J. MacCormick, N. Murphy, M. Najork, C. Thekkath, and L. Zhou, “Boxwood: Abstractions as a Foundation for Storage Infrastructure,” Proc. Conf. Operating System Design and Implementation, 2004.
[32] C. Moretti, J. Bulosan, P. Flynn, and D. Thain, “All-pairs: An Abstraction for Data Intensive Cloud Computing,” Proc. Int'l Parallel and Distributed Processing Symp. (IPDPS), 2008.
[33] “Iris Challenge Evaluation Data,” Nat'l Inst. of Standards and Technology, http://iris.nist.govice/, Apr. 2008.
[34] D. Patterson, “The Data Center is the Computer,” Comm. ACM, vol. 51, Jan. 2008.
[35] P. Phillips et al., “Overview of the Face Recognition Grand Challenge,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[36] R. Pordes et al., “The Open Science Grid,” J. Physics: Conf. Series, vol. 78, 2007.
[37] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, “Design and Implementation of the Sun Network Filesystem,” Proc. USENIX Summer Technical Conf., pp. 119-130, 1985.
[38] W.T. Sullivan, D. Werthimer, S. Bowyer, J. Cobb, D. Gedye, and D. Anderson, “A New Major SETI Project Based on Project Serendip Data and 100,000 Personal Computers,” Proc. Fifth Int'l Conf. Bioastronomy, 1997.
[39] D. Thain, C. Moretti, and J. Hemmes, “Chirp: A Practical Global File System for Cluster and Grid Computing,” J.Grid Computing, 2008.
[40] D. Thain, T. Tannenbaum, and M. Livny, “Condor and the Grid,” Grid Computing: Making the Global Infrastructure a Reality, F.Berman, G. Fox, and T. Hey, eds., John Wiley, 2003.
[41] D. Thain, T. Tannenbaum, and M. Livny, “How to Measure a Large Open Source Distributed System,” Concurrency and Computation: Practice and Experience, vol. 18, pp. 1989-2019, 2006.
[42] Y. Zhao, J. Dobson, L. Moreau, I. Foster, and M. Wilde, “A Notation and System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data,” Proc. ACM SIGMOD, 2005.

Index Terms:
All-pairs, biometrics, cloud computing, data intensive computing, grid computing.
Citation:
Christopher Moretti, Hoang Bui, Karen Hollingsworth, Brandon Rich, Patrick Flynn, Douglas Thain, "All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids," IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 1, pp. 33-46, Jan. 2010, doi:10.1109/TPDS.2009.49
Usage of this product signifies your acceptance of the Terms of Use.