This Article 
 Bibliographic References 
 Add to: 
Dynamic Load Balancing for the Distributed Mining of Molecular Structures
August 2006 (vol. 17 no. 8)
pp. 773-785

Abstract—In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiver-initiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute's HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a nondedicated computational environment. These features make it suitable for large-scale, multidomain, heterogeneous environments, such as computational grids.

[1] M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds,” Proc. IEEE Int'l Conf. Data Mining (ICDM '03), Nov. 2003.
[2] T. Washio and H. Motoda, “State of the Art of Graph-Based Data Mining,” ACM SIGKDD Explorations Newsletter, vol. 5, no. 1, pp. 59-68, July 2003.
[3] R. Agrawal, T. Imielinski, and A.N. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. 1993 ACM SIGMOD Int'l Conf. Management of Data, May 1993.
[4] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “New Algorithms for Fast Discovery of Association Rules,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD '97), pp. 283-296, 1997.
[5] R. Finkel and U. Manber, “DIB— A Distributed Implementation of Backtracking,” ACM Trans. Programming Languages and Systems, vol. 9, no. 2, pp. 235-256, Apr. 1987.
[6] M. Deshpande, M. Kuramochi, and G. Karypis, “Automated Approaches for Classifying Structures,” Proc. Workshop Data Mining in Bioinformatics (BioKDD), pp. 11-18, 2002.
[7] X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining,” Proc. IEEE Int'l Conf. Data Mining (ICDM '02), 2002.
[8] C. Borgelt and M.R. Berthold, “Mining Molecular Fragments: Finding Relevant Substructures of Molecules,” Proc. IEEE Int'l Conf. Data Mining (ICDM '02), pp. 51-58, Dec. 2002.
[9] S. Kramer, L. de Raedt, and C. Helma, “Molecular Feature Mining in HIV Data,” Proc. Seventh Int'l Conf. Knowledge Discovery and Data Mining, (KDD '01), pp. 136-143, 2001.
[10] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1979.
[11] E.M. Luks, “Isomorphism of Graphs of Bounded Valence Can Be Tested in Polynomial Time,” J. Computer and System Sciences, vol. 25, pp. 42-65, Aug. 1982.
[12] M.J. Zaki, “Parallel and Distributed Association Mining: A Survey,” IEEE Concurrency, vol. 7, no. 4, pp. 14-25, 1999.
[13] G. Di Fatta and M.R. Berthold, “Distributed Mining of Molecular Fragments,” Proc. IEEE DM-Grid Workshop Int'l Conf. Data Mining (ICDM '04), Nov. 2004.
[14] C. Wang and S. Parthasarathy, “Parallel Algorithms for Mining Frequent Structural Motifs in Scientific Data,” Proc. 18th Ann. Int'l Conf. Supercomputing (ICS '04), June 2004.
[15] F. Schreiber and H. Schwöbbermeyer, “Towards Motif Detection in Networks: Frequency Concepts and Flexible Search,” Proc. Int'l Workshop Network Tools and Applications in Biology (NETTAB '04), pp. 91-102, Sept. 2004.
[16] G. Di Fatta and M.R. Berthold, “High Performance Subgraph Mining in Molecular Compounds,” Springer's LNCS Proc. 2005 Int'l Conf. High Performance Computing and Comm. (HPCC-05), Sept. 2005.
[17] G. Di Fatta and M.R. Berthold, “Efficient Mining of Discriminative Molecular Fragments,” Proc. 17th IASTED Int'l Conf. Parallel and Distributed Computing and Systems (PDCS-05), Nov. 2005.
[18] R. Agrawal and J. Shafer, “Parallel Mining of Association Rules,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 6, pp. 962-969, Dec. 1996.
[19] E. Han, G. Karypis, and V. Kumar, “Scalable Parallel Data Mining for Association Rules,” IEEE Trans. Knowledge and Data Eng., vol. 12, no. 3, pp. 337-352, May/June 2000.
[20] Daylight Chemical Information Systems, Inc., SMILES— Simplified Molecular Input Line Entry Specification, http://www.daylight.comsmiles, 2006.
[21] R. Karp and Y. Zhang, “A Randomized Parallel Branch-and-Bound Procedure,” Proc. 20th Ann. ACM Symp. Theory of Computing (STOC '88), pp. 290-300, 1988.
[22] S. Chakrabarti, A. Ranade, and K. Yelick, “Randomized Load-Balancing for Tree-Structured Computation,” Proc. Scalable High Performance Computing Conf. (SHPCC '94), pp. 666-673, May 1994.
[23] Y. Chung, J. Park, and S. Yoon, “An Asynchronous Algorithm for Balancing Unpredictable Workload on Distributed-Memory Machines,” ETRI J., vol. 20, no. 4, pp. 346-360, Dec. 1998.
[24] V. Kumar, A. Grama, and V.N. Rao, “Scalable Load Balancing Techniques for Parallel Computer,” J. Parallel and Distributed Computing, vol. 22, no. 1, pp. 60-79, July 1994.
[25] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing. Addison-Wesley, 2003.
[26] Nat'l Cancer Inst., DTP AIDS Antiviral Screen Dataset, , May 2004.
[27] O. Weislow, R. Kiser, D. Fine, J. Bader, R. Shoemaker, and M. Boyd, “New Soluble Formazan Assay for HIV-1 Cytopathic Effects: Application to High Flux Screening of Synthetic and Natural Products for AIDS Antiviral Activity,” J. Nat'l Cancer Inst., vol. 81, pp. 577-586, 1989.
[28] R. Jain, D. Chiu, and W. Hawe, “A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems,” DEC Research Report TR-301, Technical Report, 1984.
[29] R. Sakellariou and J.R. Gurd, “Compile-Time Minimisation of Load Imbalance in Loop Nests,” Proc. 11th ACM Int'l Conf. Supercomputing (ICS '97), pp. 277-284, July 1997.

Index Terms:
Distributed computing, peer-to-peer computing, dynamic load balancing, subgraph mining, frequent patterns, biochemical databases, molecular compounds.
Giuseppe Di Fatta, Michael R. Berthold, "Dynamic Load Balancing for the Distributed Mining of Molecular Structures," IEEE Transactions on Parallel and Distributed Systems, vol. 17, no. 8, pp. 773-785, Aug. 2006, doi:10.1109/TPDS.2006.101
Usage of this product signifies your acceptance of the Terms of Use.