Subscribe
Issue No.02 - April-June (2010 vol.7)
pp: 223-237
Gunjan Gupta , University of Texas at Austin
Alexander Liu , University of Texas at Austin
Joydeep Ghosh , University of Texas at Austin
ABSTRACT
A key application of clustering data obtained from sources such as microarrays, protein mass spectroscopy, and phylogenetic profiles is the detection of functionally related genes. Typically, only a small number of functionally related genes cluster into one or more groups, and the rest need to be ignored. For such situations, we present Automated Hierarchical Density Shaving (Auto-HDS), a framework that consists of a fast hierarchical density-based clustering algorithm and an unsupervised model selection strategy. Auto-HDS can automatically select clusters of different densities, present them in a compact hierarchy, and rank individual clusters using an innovative stability criteria. Our framework also provides a simple yet powerful 2D visualization of the hierarchy of clusters that is useful for further interactive exploration. We present results on Gasch and Lee microarray data sets to show the effectiveness of our methods. Additional results on other biological data are included in the supplemental material.
INDEX TERMS
Mining methods and algorithms, data and knowledge visualization, clustering, bioinformatics.
CITATION
Gunjan Gupta, Alexander Liu, Joydeep Ghosh, "Automated Hierarchical Density Shaving: A Robust Automated Clustering and Visualization Framework for Large Biological Data Sets", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 2, pp. 223-237, April-June 2010, doi:10.1109/TCBB.2008.32
REFERENCES
 [1] M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander, "OPTICS: Ordering Points to Identify the Clustering Structure," Proc. ACM SIGMOD '99, pp. 49-60, 1999. [2] A. Banerjee, S. Basu, C. Krumpelman, J. Ghosh, and R. Mooney, "Model Based Overlapping Clustering," Proc. ACM SIGKDD '05, pp. 100-106, 2005. [3] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha, "A Generalized Maximum Entropy to Bregman Co-Clustering and Matrix Approximation," J. Machine Learning Research, vol. 8. [4] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, "Clustering on the Unit Hypersphere Using Von Mises-Fisher Distributions," J. Machine Learning Research, vol. 6, pp. 1345-1382, 2005. [5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, "Clustering with Bregman Divergences," J. Machine Learning Research, vol. 6, pp. 1705-1749, 2005. [6] S. Basu, A. Banerjee, and R.J. Mooney, "Semi-Supervised Clustering by Seeding," Proc. 19th Int'l Conf. Machine Learning (ICML '02), pp. 27-34, 2002. [7] M. Bellis and J. Hennetin, "Application of Gene DIVER to the Study of Geometrical Representations of Gene Expression Covariation," IEEE/ACM Transaction Computational Biology and Bioinformatics, supplement 3, 2008. [8] M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, "LOF: Identifying Density-Based Local Outliers," Proc. ACM SIGMOD '00, pp. 93-104, 2000. [9] S.V. Chakaravathy and J. Ghosh, "Scale Based Clustering Using a Radial Basis Function Network," IEEE Trans. Neural Networks, vol. 2, no. 5, pp. 1250-1261, Sept. 1996. [10] H. Cho, I.S. Dhillon, Y. Guan, and S. Sra, "Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data," Proc. Fourth SIAM Int'l Conf. Data Mining (SDM '04), pp. 114-125, Apr. 2004. [11] I. Dhillon, S. Mallela, and D. Modha, "Information-Theoretic Co-Clustering," Proc. ACM SIGKDD '03, pp. 89-98, 2003. [12] I.S. Dhillon, E.M. Marcotte, and U. Roshan, "Diametrical Clustering for Identifying Anti-Correlated Gene Clusters," Bioinformatics, vol. 19, pp. 1612-1619, 2003. [13] I.S. Dhillon and D.S. Modha, "Concept Decompositions for Large Sparse Text Data Using Clustering," Machine Learning, vol. 42, no. 1-2, pp. 143-175, Jan.-Feb. 2001. [14] A. Enright, S. Van Dongen, and C. Ouzounis, "An Efficient Algorithm for Large-Scale Detection of Protein Families," Nucleic Acids Research, vol. 30, no. 7, pp. 1575-1584, 2002. [15] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proc. ACM SIGKDD '96, pp. 226-231, 1996. [16] B. Everitt, Cluster Analysis. Heinemann Educational Books, 1974. [17] A.P. Gasch et al., "Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes," Molecular Biology of the Cell, vol. 11, no. 3, pp. 4241-4257, Dec. 2000. [18] J. Gollub et al., "The Stanford Microarray Database: Data Access and Quality Assessment Tools," Nucleic Acids Research, vol. 31, pp. 94-96, 2003. [19] G. Gupta, "Robust Methods for Locating Multiple Dense Regions in Complex Datasets," PhD dissertation, Univ. of Texas at Austin, Dec. 2006. [20] G. Gupta and J. Ghosh, "Robust One-Class Clustering Using Hybrid Global and Local Search," Proc. 22nd Int'l Conf. Machine Learning (ICML '05), pp. 273-280, Aug. 2005. [21] G. Gupta, A. Liu, and J. Ghosh, "Automatic Hierarchical Density Shaving and Gene DIVER," Technical Report IDEAL-TR05, Dept. Electrical and Computer Eng., Univ. of Texas at Austin, http://www.lans.ece.utexas.edutechreps.html , 2006. [22] G. Gupta, A. Liu, and J. Ghosh, "Clustering and Visualization of High-Dimensional Biological Datasets Using a Fast HMA Approximation," Proc. Artificial Neural Networks in Eng. Conf. (ANNIE '06), Nov. 2006. [23] G. Gupta, A. Liu, and J. Ghosh, "Hierarchical Density Shaving: A Clustering and Visualization Framework for Large Biological Datasets," Proc. IEEE ICDM Workshop Data Mining in Bioinformatics (DMB '06), pp. 89-93, Dec. 2006. [24] G. Gupta, A. Liu, and J. Ghosh, "An Extended Example of Creating ${L}_{HDS}$ ," IEEE/ACM Trans. Computational Biology and Bioinformatics, supplement 4, 2008. [25] G. Gupta, A. Liu, and J. Ghosh, "Gene DIVER: Gene Density Interactive Visual ExploreR," IEEE/ACM Trans. Computational Biology and Bioinformatics, supplement 1, 2008. [26] T. Hastie et al., "Gene Shaving as a Method for Identifying Distinct Sets of Genes with Similar Expression Patterns," Genome Biology, vol. 1, pp. 1-21, 2000. [27] L. Hubert and P. Arabie, "Comparing Partitions," J. Classification, pp. 193-218, 1985. [28] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988. [29] H. Jenq-Neng, L. Shyh-Rong, and A. Lippman, "Nonparametric Multivariate Density Estimation: A Comparative Study," Science, vol. 42, no. 10, pp. 2795-2810, Oct. 1994. [30] D. Jiang, J. Pei, and A. Zhang, "DHC: A Density-Based Hierarchical Clustering Method for Time Series Gene Expression Data," Proc. Third IEEE Int'l Symp. BioInformatics and BioEngineering (BIBE '03), p. 393, 2003. [31] L. Lazzeroni and A.B. Owen, "Plaid Models for Gene Expression Data," Statistica Sinica, vol. 12, no. 1, pp. 61-86, Jan. 2002. [32] I. Lee, S.V. Date, A.T. Adai, and E.M. Marcotte, "A Probabilistic Functional Network of Yeast Genes," Science, vol. 306, pp. 1555-1558, 2004. [33] S.C. Madeira and A.L. Oliveira, "Biclustering Algorithms for Biological Data Analysis: A Survey," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 24-45, Jan.-Mar. 2004. [34] R. Mansson, P. Tsapogas, M.A. et al., "Pearson Correlation Analysis of Microarray Data Allows for the Identification of Genetic Targets for Early B-Cell Factor," J. Biological Chemistry, vol. 279, no. 17, pp. 17905-17913, Apr. 2004. [35] E.M. Marcotte, I. Xenarios, A.M. van Der Bliek, and D. Eisenberg, "Localizing Proteins in the Cell from Their Phylogenetic Profiles," Proc. Nat'l Academy of Sciences USA, vol. 97, no. 22, pp. 12115-12120, Oct. 2000. [36] K.L. McGary and G. Gupta, "Discovering Functionally Related Genes in Yeast Using Gene DIVER on Phylogenetic Profile Data," IEEE/ACM Trans. Computational Biology and Bioinformatics, supplement 2, 2008. [37] M.D. Robinson, J. Grigull, N. Mohammad, and T.R. Hughes, "FunSpec: A Web-Based Cluster Interpreter for Yeast," BMC Bioinformatics, vol. 35, no. 3, Nov. 2002. [38] E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller, "Rich Probabilistic Models for Gene Expression," Bioinformatics, vol. 17, no. 1, pp. 243-252, 2003. [39] R. Sharan and R. Shamir, "Click: A Clustering Algorithm with Applications to Gene Expression Analysis," Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '00), pp. 307-316, 2000. [40] D. Stoll, J. Bachmann, M.F. Templin, and T.O. Joos, "Microarray Technology: An Increasing Variety of Screening Tools for Proteomic Research," Drug Discovery Today: TARGETS, vol. 3, no. 1, pp. 24-31, Feb. 2004. [41] A. Strehl and J. Ghosh, "Relationship-Based Clustering and Visualization for High-Dimensional Data Mining," INFORMS J. Computing, vol. 15, no. 2, pp. 208-230, 2003. [42] W. Stuetzle, "Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample," J. Classification, vol. 20, pp. 25-47, 2003. [43] J.B. Tenenbaum, V. de Silva, and J.C. Langford, "A Global Geometric Framework for Nonlinear Dimensionality Reduction," Science, vol. 290, pp. 2319-2323, 2000. [44] D. Wishart, "Mode Analysis: A Generalization of Nearest Neighbour Which Reduces Chaining Effects," Proc. Colloquium Numerical Taxonomy, pp. 282-308, Sept. 1968.