
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Giuseppe Di Fatta, Michael R. Berthold, "Dynamic Load Balancing for the Distributed Mining of Molecular Structures," IEEE Transactions on Parallel and Distributed Systems, vol. 17, no. 8, pp. 773785, August, 2006.  
BibTex  x  
@article{ 10.1109/TPDS.2006.101, author = {Giuseppe Di Fatta and Michael R. Berthold}, title = {Dynamic Load Balancing for the Distributed Mining of Molecular Structures}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {17}, number = {8}, issn = {10459219}, year = {2006}, pages = {773785}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2006.101}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  Dynamic Load Balancing for the Distributed Mining of Molecular Structures IS  8 SN  10459219 SP773 EP785 EPD  773785 A1  Giuseppe Di Fatta, A1  Michael R. Berthold, PY  2006 KW  Distributed computing KW  peertopeer computing KW  dynamic load balancing KW  subgraph mining KW  frequent patterns KW  biochemical databases KW  molecular compounds. VL  17 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peertopeer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the wellknown National Cancer Institute's HIVscreening data set, where we were able to show closeto linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a nondedicated computational environment. These features make it suitable for largescale, multidomain, heterogeneous environments, such as computational grids.
[1] M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent SubStructureBased Approaches for Classifying Chemical Compounds,” Proc. IEEE Int'l Conf. Data Mining (ICDM '03), Nov. 2003.
[2] T. Washio and H. Motoda, “State of the Art of GraphBased Data Mining,” ACM SIGKDD Explorations Newsletter, vol. 5, no. 1, pp. 5968, July 2003.
[3] R. Agrawal, T. Imielinski, and A.N. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. 1993 ACM SIGMOD Int'l Conf. Management of Data, May 1993.
[4] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “New Algorithms for Fast Discovery of Association Rules,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD '97), pp. 283296, 1997.
[5] R. Finkel and U. Manber, “DIB— A Distributed Implementation of Backtracking,” ACM Trans. Programming Languages and Systems, vol. 9, no. 2, pp. 235256, Apr. 1987.
[6] M. Deshpande, M. Kuramochi, and G. Karypis, “Automated Approaches for Classifying Structures,” Proc. Workshop Data Mining in Bioinformatics (BioKDD), pp. 1118, 2002.
[7] X. Yan and J. Han, “gSpan: GraphBased Substructure Pattern Mining,” Proc. IEEE Int'l Conf. Data Mining (ICDM '02), 2002.
[8] C. Borgelt and M.R. Berthold, “Mining Molecular Fragments: Finding Relevant Substructures of Molecules,” Proc. IEEE Int'l Conf. Data Mining (ICDM '02), pp. 5158, Dec. 2002.
[9] S. Kramer, L. de Raedt, and C. Helma, “Molecular Feature Mining in HIV Data,” Proc. Seventh Int'l Conf. Knowledge Discovery and Data Mining, (KDD '01), pp. 136143, 2001.
[10] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NPCompleteness. W.H. Freeman, 1979.
[11] E.M. Luks, “Isomorphism of Graphs of Bounded Valence Can Be Tested in Polynomial Time,” J. Computer and System Sciences, vol. 25, pp. 4265, Aug. 1982.
[12] M.J. Zaki, “Parallel and Distributed Association Mining: A Survey,” IEEE Concurrency, vol. 7, no. 4, pp. 1425, 1999.
[13] G. Di Fatta and M.R. Berthold, “Distributed Mining of Molecular Fragments,” Proc. IEEE DMGrid Workshop Int'l Conf. Data Mining (ICDM '04), Nov. 2004.
[14] C. Wang and S. Parthasarathy, “Parallel Algorithms for Mining Frequent Structural Motifs in Scientific Data,” Proc. 18th Ann. Int'l Conf. Supercomputing (ICS '04), June 2004.
[15] F. Schreiber and H. Schwöbbermeyer, “Towards Motif Detection in Networks: Frequency Concepts and Flexible Search,” Proc. Int'l Workshop Network Tools and Applications in Biology (NETTAB '04), pp. 91102, Sept. 2004.
[16] G. Di Fatta and M.R. Berthold, “High Performance Subgraph Mining in Molecular Compounds,” Springer's LNCS Proc. 2005 Int'l Conf. High Performance Computing and Comm. (HPCC05), Sept. 2005.
[17] G. Di Fatta and M.R. Berthold, “Efficient Mining of Discriminative Molecular Fragments,” Proc. 17th IASTED Int'l Conf. Parallel and Distributed Computing and Systems (PDCS05), Nov. 2005.
[18] R. Agrawal and J. Shafer, “Parallel Mining of Association Rules,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 6, pp. 962969, Dec. 1996.
[19] E. Han, G. Karypis, and V. Kumar, “Scalable Parallel Data Mining for Association Rules,” IEEE Trans. Knowledge and Data Eng., vol. 12, no. 3, pp. 337352, May/June 2000.
[20] Daylight Chemical Information Systems, Inc., SMILES— Simplified Molecular Input Line Entry Specification, http://www.daylight.comsmiles, 2006.
[21] R. Karp and Y. Zhang, “A Randomized Parallel BranchandBound Procedure,” Proc. 20th Ann. ACM Symp. Theory of Computing (STOC '88), pp. 290300, 1988.
[22] S. Chakrabarti, A. Ranade, and K. Yelick, “Randomized LoadBalancing for TreeStructured Computation,” Proc. Scalable High Performance Computing Conf. (SHPCC '94), pp. 666673, May 1994.
[23] Y. Chung, J. Park, and S. Yoon, “An Asynchronous Algorithm for Balancing Unpredictable Workload on DistributedMemory Machines,” ETRI J., vol. 20, no. 4, pp. 346360, Dec. 1998.
[24] V. Kumar, A. Grama, and V.N. Rao, “Scalable Load Balancing Techniques for Parallel Computer,” J. Parallel and Distributed Computing, vol. 22, no. 1, pp. 6079, July 1994.
[25] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing. AddisonWesley, 2003.
[26] Nat'l Cancer Inst., DTP AIDS Antiviral Screen Dataset, http://dtp.nci.nih.gov/docs/aids/aidsdata.html , May 2004.
[27] O. Weislow, R. Kiser, D. Fine, J. Bader, R. Shoemaker, and M. Boyd, “New Soluble Formazan Assay for HIV1 Cytopathic Effects: Application to High Flux Screening of Synthetic and Natural Products for AIDS Antiviral Activity,” J. Nat'l Cancer Inst., vol. 81, pp. 577586, 1989.
[28] R. Jain, D. Chiu, and W. Hawe, “A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems,” DEC Research Report TR301, Technical Report, 1984.
[29] R. Sakellariou and J.R. Gurd, “CompileTime Minimisation of Load Imbalance in Loop Nests,” Proc. 11th ACM Int'l Conf. Supercomputing (ICS '97), pp. 277284, July 1997.