The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - March/April (2012 vol.9)
pp: 408-420
S. Mimaroglu , Dept. of Comput. Eng., Bahcesehir Univ., Istanbul, Turkey
E. Aksehirli , Dept. of Comput. Eng., Bahcesehir Univ., Istanbul, Turkey
ABSTRACT
Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on the same data set, or the same method with varying input parameters or both. We propose a novel method, DICLENS, which combines a set of clusterings into a final clustering having better overall quality. Our method produces the final clustering automatically and does not take any input parameters, a feature missing in many existing algorithms. Extensive experimental studies on real, artificial, and gene expression data sets demonstrate that DICLENS produces very good quality clusterings in a short amount of time. DICLENS implementation runs on standard personal computers by being scalable, and by consuming very little memory and CPU.
INDEX TERMS
Clustering algorithms, Gene expression, Clustering methods, Partitioning algorithms, Bioinformatics, Computational biology, Software algorithms,gene expressions., Clustering, combining multiple clusterings, cluster ensembles, consensus clustering, evidence accumulation, minimum spanning tree
CITATION
S. Mimaroglu, E. Aksehirli, "DICLENS: Divisive Clustering Ensemble with Automatic Cluster Number", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 2, pp. 408-420, March/April 2012, doi:10.1109/TCBB.2011.129
REFERENCES
[1] N. Iam-on, T. Boongoen, and S. Garrett, “LCE: A Link-Based Cluster Ensemble Method for Improved Gene Expression Data Analysis,” Bioinformatics, vol. 26, no. 12, pp. 1513-1519, June 2010.
[2] M. de Souto, I. Costa, D. de Araujo, T. Ludermir, and A. Schliep, “Clustering Cancer Gene Expression Data: A Comparative Study,” BMC Bioinformatics, vol. 9, no. 1,article 497, 2008.
[3] A. Strehl and J. Ghosh, “Cluster—A Knowledge Reuse Framework for Combining Multiple Partitions,” J. Machine Learning Research, vol. 3, pp. 583-617, Dec. 2002.
[4] S. Mimaroglu and E. Erdil, “Obtaining Better Quality Final Clustering by Merging a Collection of Clusterings,” Bioinformatics, vol. 26, pp. 2645-2646, 2010.
[5] A. Fred and A. Jain, “Combining Multiple Clusterings Using Evidence Accumulation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835-850, June 2005.
[6] E. Kim, S. Kim, D. Ashlock, and D. Nam, “MULTI-K: Accurate Classification of Microarray Subtypes Using Ensemble K-Means Clustering,” BMC Bioinformatics, vol. 10, no. 1,article 260, 2009.
[7] A. Topchy, A.K. Jain, and W. Punch, “Combining Multiple Weak Clusterings,” Proc. IEEE Third Int'l Conf. Data Mining, pp. 331-338, 2003.
[8] H. Cho and I.S. Dhillon, “Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 5, no. 3, pp. 385-400, July-Sept. 2008.
[9] P. Mahata, “Exploratory Consensus of Hierarchical Clusterings for Melanoma and Breast Cancer,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 138-152, Jan.-Mar. 2010.
[10] Z. Yu, H. Wong, and H. Wang, “Graph-Based Consensus Clustering for Class Discovery from Gene Expression Data,” Bioinformatics, vol. 23, no. 21, pp. 2888-2896, Nov. 2007.
[11] G. Karypis and V. Kumar, “Multilevel Algorithms for Multi-Constraint Graph Partitioning,” Proc. IEEE/ACM Conf. Supercomputing (SC '98), p. 28, 1998.
[12] G. Karypis and V. Kumar, “Multilevel K-Way Hypergraph Partitioning,” Proc. 36th Ann. Design Automation Conf., pp. 343-348, 1999.
[13] S. Dudoit and J. Fridlyand, “Bagging to Improve the Accuracy of a Clustering Procedure,” Bioinformatics, vol. 19, no. 9, pp. 1090-1099, June 2003.
[14] X.Z. Fern and C.E. Brodley, “Solving Cluster Ensemble Problems by Bipartite Graph Partitioning,” Proc. 21st Int'l Conf. Machine Learning, p. 36, 2004.
[15] H.G. Ayad and M.S. Kamel, “On Voting-Based Consensus of Cluster Ensembles,” Pattern Recognition, vol. 43, no. 5, pp. 1943-1953, May 2010.
[16] X. Wang, C. Yang, and J. Zhou, “Clustering Aggregation by Probability Accumulation,” Pattern Recognition, vol. 42, no. 5, pp. 668-675, May 2009.
[17] S. Vega-Pons, J. Correa-Morris, and J. Ruiz-Shulcloper, “Weighted Partition Consensus via Kernels,” Pattern Recognition, vol. 43, no. 8, pp. 2712-2724, Aug. 2010.
[18] J. Barthelemy and B. Leclerc, “The Median Procedure for Partitions,” Partitioning Data Sets, AMS DIMACS Series in Discrete Math., vol. 19, pp. 3-34, 1995.
[19] S. Mimaroglu and A.M. Yagci, “A Binary Method for Fast Computation of Inter and Intra Cluster Similarities for Combining Multiple Clusterings,” Proc. Second Int'l Conf. Interaction Sciences: Information Technology, Culture and Human, pp. 452-456, 2009.
[20] V. Olman, F. Mao, H. Wu, and Y. Xu, “Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 6, no. 2, pp. 344-352, Apr.-June 2009.
[21] L. Hubert and P. Arabie, “Comparing Partitions,” J. Classification, vol. 2, no. 1, pp. 193-218, 1985.
[22] W.M. Rand, “Objective Criteria for the Evaluation of Clustering Methods,” J. Am. Statistical Assoc., vol. 66, no. 336, pp. 846-850, Dec. 1971.
[23] L. Dyrskjot, T. Thykjaer, M. Kruh0ffer, J.L. Jensen, N. Marcussen, S. Hamilton-Dutoit, H. Wolf, and T.F. Orntoft, “Identifying Distinct Classes of Bladder Carcinoma Using Microarrays,” Nature Genetics, vol. 33, no. 1, pp. 90-96, Jan. 2003.
[24] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J.A. Olson, J.R. Marks, and J.R. Nevins, “Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 20, pp. 11462-11467, Sept. 2001.
[25] D. Chowdary, J. Lathrop, J. Skelton, K. Curtin, T. Briggs, Y. Zhang, J. Yu, Y. Wang, and A. Mazumder, “Prognostic Gene Expression Signatures Can be Measured in Tissues Collected in RNAlater Preservative,” J. Molecular Diagnostics, vol. 8, no. 1, pp. 31-39, Feb. 2006.
[26] A.I. Su, J.B. Welsh, L.M. Sapinoso, S.G. Kern, P. Dimitrov, H. Lapp, P.G. Schultz, S.M. Powell, C.A. Moskaluk, H.F. Frierson, and G.M. Hampton, “Molecular Classification of Human Carcinomas by Use of Gene Expression Signatures,” Cancer Research, vol. 61, no. 20, pp. 7388-7393, Oct. 2001.
[27] S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M.E. McLaughlin, J.Y.H. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, J.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander, and T.R. Golub, “Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression,” Nature, vol. 415, no. 6870, pp. 436-442, Jan. 2002.
[28] J.I. Risinger, G.L. Maxwell, G.V.R. Chandramouli, A. Jazaeri, O. Aprelikova, T. Patterson, A. Berchuck, and J.C. Barrett, “Microarray Analysis Reveals Distinct Gene Expression Profiles among Different Histologic Types of Endometrial Cancer,” Cancer Research, vol. 63, no. 1, pp. 6-11, Jan. 2003.
[29] Y. Liang, M. Diehn, N. Watson, A.W. Bollen, K.D. Aldape, M.K. Nicholas, K.R. Lamborn, M.S. Berger, D. Botstein, P.O. Brown, and M.A. Israel, “Gene Expression Profiling Reveals Molecularly and Clinically Distinct Subtypes of Glioblastoma Multiforme,” Proc. Nat'l Academy of Sciences USA, vol. 102, no. 16, pp. 5814-5819, Apr. 2005.
[30] M. Bredel, C. Bredel, D. Juric, G.R. Harsh, H. Vogel, L.D. Recht, and B.I. Sikic, “Functional Network Analysis Reveals Extended Gliomagenesis Pathway Maps and Three Novel MYC-Interacting Genes in Human Gliomas,” Cancer Research, vol. 65, no. 19, pp. 8679-8689, Oct. 2005.
[31] C.L. Nutt, D.R. Mani, R.A. Betensky, P. Tamayo, J.G. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M.E. McLaughlin, T.T. Batchelor, P.M. Black, A. von Deimling, S.L. Pomeroy, T.R. Golub, and D.N. Louis, “Gene Expression-Based Classification of Malignant Gliomas Correlates Better with Survival than Histological Classification,” Cancer Research, vol. 63, no. 7, pp. 1602-1607, Apr. 2003.
[32] X. Chen, S.T. Cheung, S. So, S.T. Fan, C. Barry, J. Higgins, K. Lai, J. Ji, S. Dudoit, I.O.L. Ng, M. van de Rijn, D. Botstein, and P.O. Brown, “Gene Expression Patterns in Human Liver Cancers,” Molecular Biology of the Cell, vol. 13, no. 6, pp. 1929-1939, 2002.
[33] E. Yeoh, M.E. Ross, S.A. Shurtleff, W.K. Williams, D. Patel, R. Mahfouz, F.G. Behm, S.C. Raimondi, M.V. Relling, A. Patel, C. Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C. Pui, W.E. Evans, C. Naeve, L. Wong, and J.R. Downing, “Classification, Subtype Discovery, and Prediction of Outcome in Pediatric Acute Lymphoblastic Leukemia by Gene Expression Profiling,” Cancer Cell, vol. 1, no. 2, pp. 133-143, Mar. 2002.
[34] S.A. Armstrong, J.E. Staunton, L.B. Silverman, R. Pieters, M.L. den Boer, M.D. Minden, S.E. Sallan, E.S. Lander, T.R. Golub, and S.J. Korsmeyer, “MLL Translocations Specify a Distinct Gene Expression Profile that Distinguishes a Unique Leukemia,” Nature Genetics, vol. 30, no. 1, pp. 41-47, Jan. 2002.
[35] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, no. 5439, pp. 531-537, Oct. 1999.
[36] A. Bhattacharjee, W.G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E.J. Mark, E.S. Lander, W. Wong, B.E. Johnson, T.R. Golub, D.J. Sugarbaker, and M. Meyerson, “Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 24, pp. 13790-13795, Nov. 2001.
[37] M.E. Garber, O.G. Troyanskaya, K. Schluens, S. Petersen, Z. Thaesler, M. Pacyna-Gengelbach, M. van de Rijn, G.D. Rosen, C.M. Perou, R.I. Whyte, R.B. Altman, P.O. Brown, D. Botstein, and I. Petersen, “Diversity of Gene Expression in Adenocarcinoma of the Lung,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 24, pp. 13784-13789, Nov. 2001.
[38] A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, and L.M. Staudt, “Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling,” Nature, vol. 403, no. 6769, pp. 503-511, Feb. 2000.
[39] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Ray, M.A. Koval, K.W. Last, A. Norton, T.A. Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub, “Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene-Expression Profiling and Supervised Machine Learning,” Nature Medicine, vol. 8, no. 1, pp. 68-74, Jan. 2002.
[40] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, and V. Sondak, “Molecular Classification of Cutaneous Malignant Melanoma by Gene Expression Profiling,” Nature, vol. 406, no. 6795, pp. 536-540, Aug. 2000.
[41] G.J. Gordon, R.V. Jensen, L. Hsiao, S.R. Gullans, J.E. Blumenstock, S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, and R. Bueno, “Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma,” Cancer Research, vol. 62, no. 17, pp. 4963-4967, Sept. 2002.
[42] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander, and T.R. Golub, “Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 26, pp. 15149-15154, Dec. 2001.
[43] S.A. Tomlins, R. Mehra, D.R. Rhodes, X. Cao, L. Wang, S.M. Dhanasekaran, S. Kalyana-Sundaram, J.T. Wei, M.A. Rubin, K.J. Pienta, R.B. Shah, and A.M. Chinnaiyan, “Integrative Molecular Concept Modeling of Prostate Cancer Progression,” Nature Genetics, vol. 39, no. 1, pp. 41-51, Jan. 2007.
[44] J. Lapointe, C. Li, J.P. Higgins, M. van de Rijn, E. Bair, K. Montgomery, M. Ferrari, L. Egevad, W. Rayford, U. Bergerheim, P. Ekman, A.M. DeMarzo, R. Tibshirani, D. Botstein, P.O. Brown, J.D. Brooks, and J.R. Pollack, “Gene Expression Profiling Identifies Clinically Relevant Subtypes of Prostate Cancer,” Proc. Nat'l Academy of Sciences USA, vol. 101, no. 3, pp. 811-816, Jan. 2004.
[45] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D'Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, and W.R. Sellers, “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell, vol. 1, no. 2, pp. 203-209, Mar. 2002.
[46] J. Khan, J.S. Wei, M. Ringner, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, and P.S. Meltzer, “Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks,” Nature Medicine, vol. 7, no. 6, pp. 673-679, June 2001.
[47] P. Laiho, A. Kokko, S. Vanharanta, R. Salovaara, H. Sammalkorpi, H. Jarvinen, J. Mecklin, T.J. Karttunen, K. Tuppurainen, V. Davalos, S. Schwartz, D. Arango, M.J. Makinen, and L.A. Aaltonen, “Serrated Carcinomas form a Subclass of Colorectal Cancer with Distinct Molecular Basis,” Oncogene, vol. 26, no. 2, pp. 312-320, Jan. 2007.
[48] A. Frank and A. Asuncion, UCI Machine Learning Repository, School of Information and Computer Sciences, Univ. of California, 2010.
[49] G. Karypis, E. Han, and V. Kumar, “Chameleon: Hierarchical Clustering Using Dynamic Modeling,” Computer, vol. 32, no. 8, pp. 68-75, 1999.
[50] S. García and F. Herrera, “An Extension on “Statistical Comparisons of Classifiers Over Multiple Data Sets” for All Pairwise Comparisons,” J. Machine Learning Research, vol. 9, pp. 2677-2694, http://www.jmlr.org/papers/volume9/garcia08a garcia08a.pdf, Dec. 2008.
[51] M. Friedman, “The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance,” J. Am. Statistical Assoc., vol. 32, no. 200, pp. 675-701, http://www.jstor.org/stable2279372, Dec. 1937.
[52] R.L. Iman and J.M. Davenport, “Approximations of the Critical Region of the Friedman Statistic,” Comm. in Statistics, vol. A9, pp. 571-595, 1979.
[53] J.H. Zar, Biostatistical Analysis, fifth ed. Prentice Hall, Feb. 2009.
28 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool