The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - April-June (2009 vol.6)
pp: 344-352
Victor Olman , University of Georgia, Athens
Fenglou Mao , University of Georgia, Athens
Hongwei Wu , University of Georgia, Athens
Ying Xu , University of Georgia, Athens
ABSTRACT
Large sets of bioinformatical data provide a challenge in time consumption while solving the cluster identification problem, and that is why a parallel algorithm is so needed for identifying dense clusters in a noisy background. Our algorithm works on a graph representation of the data set to be analyzed. It identifies clusters through the identification of densely intraconnected subgraphs. We have employed a minimum spanning tree (MST) representation of the graph and solve the cluster identification problem using this representation. The computational bottleneck of our algorithm is the construction of an MST of a graph, for which a parallel algorithm is employed. Our high-level strategy for the parallel MST construction algorithm is to first partition the graph, then construct MSTs for the partitioned subgraphs and auxiliary bipartite graphs based on the subgraphs, and finally merge these MSTs to derive an MST of the original graph. The computational results indicate that when running on 150 CPUs, our algorithm can solve a cluster identification problem on a data set with 1,000,000 data points almost 100 times faster than on single CPU, indicating that this program is capable of handling very large data clustering problems in an efficient manner. We have implemented the clustering algorithm as the software CLUMP.
INDEX TERMS
Pattern recognition, clustering algorithm, genome application, parallel processing.
CITATION
Victor Olman, Fenglou Mao, Hongwei Wu, Ying Xu, "Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.6, no. 2, pp. 344-352, April-June 2009, doi:10.1109/TCBB.2007.70272
REFERENCES
[1] S.F. Altschul et al., “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.
[2] D.A. Bader and G. Cong, “A Fast, Parallel Spanning Tree Algorithm for Symmetric Multiprocessors (SMPs),” J. Parallel and Distributed Computing, vol. 65, no. 9, pp. 994-1006, 2005.
[3] J.L. Bentley, “Parallel Algorithm for Constructing Minimum Spanning Trees,” J. Algorithms, vol. 1, pp. 51-59, 1980.
[4] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, 1981.
[5] C.M. Bishop, Neural Networks for Pattern Recognition. Oxford Univ. Press, 1995.
[6] O. Borůvka, “O jistém problému minimálnim. Práce Mor.Přorodovd,” Spol. v Brn (Acta Societ. Natur. Moravicae), vol. 3, pp.37-58, 1926.
[7] C. Dass, An Introduction to Biological Mass Spectrometry. John Wiley & Sons, 2002.
[8] R. Dementiev, P. Sanders, and D. Schultes, “Engineering an Eternal Memory Minimum Spanning Tree Algorithm,” Proc. Third IFIP Int'l Conf. Theoretical Computer Science (TCS '04), pp.195-208, 2004.
[9] Z. Du and F. Lin, “A Novel Approach for Hierarchical Clustering,” Parallel Computing, vol. 31, no. 5, pp. 523-527, 2005.
[10] A.J. Enright, S. Van Dongen1, and S.A. Ouzounis, “An Efficient Algorithm for Large-Scale Detection of Protein Families,” Nucleic Acids Research, vol. 30, no. 7, pp. 1575-1584, 2002.
[11] R.D. Finn et al., “PFAM: Clans, Web Tools and Services,” Nucleic Acids Research, vol. 34, pp. 247-251, 2006.
[12] H.-R. Gregorius, “The Isolation Approach to Hierarchical Clustering,” J. Classification, vol. 21, pp. 51-69, 2004.
[13] D.B. Johnson and P. Metaxas, “A Parallel Algorithm for Computing Minimum Spanning Trees,” Proc. Fourth Ann. ACM Symp. Parallel Algorithms and Architectures (SPAA '92), pp. 363-372, 1992.
[14] X. Li and Z. Fang, “Parallel Clustering Algorithms,” Parallel Computing, vol. 11, pp. 275-290, 1989.
[15] Two-Hybrid Systems: Methods and Protocols (Methods in Molecular Biology), P.N. Macdonald, ed., vol. 177. The Humana Press Inc., 2001.
[16] F. Murtagh, “Clustering in Massive Data Sets,” Handbook of Massive Data Sets, pp. 501-543, 2002.
[17] V. Olman, D. Xu, and Y. Xu, “CUBIC: Identification of Regulatory Binding Sites through Data Clustering,” J. Bioinformatics and Computational Biology, vol. 1, no. 1, pp. 21-40, 2003.
[18] V. Olman, C. Hicks, P. Wang, and X. Ying, “Gene Expression Data Analysis in Subtypes of Ovarian Cancer Using Covariance Analysis,” J. Bioinformatics and Computational Biology, vol. 4, no. 5, pp. 999-1013, 2006.
[19] C.F. Olson, “Parallel Algorithms for Hierarchical Clustering,” Parallel Computing, vol. V21, pp. 1313-1325, 1995.
[20] E.M. Rasmussen and P. Willet, “Efficiency of Hierarchical Agglomerative Clustering Using ICL Distributed Array Processors,” J. Documentation, vol. 45, no. 1, pp. 1-24, 1989.
[21] H.C. Romesburg, Cluster Analysis for Researchers, 2004.
[22] Handbook of Discrete and Combinatorial Mathematics, K.H.Rosen, ed. CRC Press, 1999.
[23] R. Sibson, “SLINK: An Optimally Efficient Algorithm for the Single Link Cluster Methods,” Computer J., vol. 16, pp. 30-34, 1973.
[24] R.L. Tatusov, E.V. Koonin, and D.J. Lipman, “A Genomic Perspective on Protein Families,” Science, vol. 278, pp. 631-637, 1997.
[25] R.L. Tatusov, D.A. Natale, I.V. Garkavtsev, T.A. Tatusova, U.T. Shankavaram, B.S. Rao, B. Kiryutin, M.Y. Galperin, N.D. Fedorova, and E.V. Koonin, “The COG Database: NewDevelopments in Phylogenetic Classification of Proteins from Complete Genomes,” Nucleic Acids Research, vol. 29, pp.22-28, 2001.
[26] S.S. Wilks, Mathematical Statistics. John Wiley & Sons, 1962.
[27] H. Wu, F. Mao, V. Olman, and Y. Xu, “Accurate Prediction of Orthologous Gene Groups in Microbes,” Proc. IEEE Computational Systems Bioinformatics Conf. (CSB '05), pp. 73-79, 2005.
[28] H. Wu, Z. Su, F. Mao, V. Olman, and Y. Xu, “Prediction of Functional Modules through Comparative Genome Analysis and Application of Gene Ontology,” Nucleic Acids Research, vol. 33, pp.2822-2837, 2005.
[29] H. Wu, F. Mao, V. Olman, and X. Ying, “Hierarchical Classification of Functionally Equivalent Genes of Prokaryotes,” Nuclear Acids Research, vol. 35, pp. 2125-2140, 2007.
[30] Y. Xu, V. Olman, and D. Xu, “Clustering Gene Expression Data Using a Graph-Theoretic Approach: An Application of Minimum Spanning Tree,” Bioinformatics, vol. 18, no. 4, pp. 526-535, 2001.
6 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool