This Article 
 Bibliographic References 
 Add to: 
Strategies for Identifying Statistically Significant Dense Regions in Microarray Data
July-September 2007 (vol. 4 no. 3)
pp. 415-429
We propose and study the notion of dense regions for the analysis of categorized gene expression data and present some searching algorithms for discovering them. The algorithms can be applied to any categorical data matrices derived from gene expression level matrices. We demonstrate that dense regions are simple but useful and statistically significant patterns that can be used to 1) identify genes and/or samples of interest and 2) eliminate genes and/or samples corresponding to outliers, noise, or abnormalities. Some theoretical studies on the properties of the dense regions are presented which allow us to characterize dense regions into several classes and to derive tailor-made algorithms for different classes of regions. Moreover, an empirical simulation study on the distribution of the size of dense regions is carried out which is then used to assess the significance of dense regions and to derive effective pruning methods to speed up the searching algorithms. Real microarray data sets are employed to test our methods. Comparisons with six other well-known clustering algorithms using synthetic and real data are also conducted which confirm the superiority of our methods in discovering dense regions. The DRIFT code and a tutorial are available as supplemental material, which can be found on the Computer Society Digital Library at

[1] M. Eisen, P. Spellman, P. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences USA, vol. 85, pp. 14863-14868, 1998.
[2] T.F. Cox and M.A.A. Cox, Multidimensional Scaling, second ed. Chapman & Hall/CRC, 2000.
[3] S. Dudoit, J. Fridlyand, and T. Speed, “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, no. 457, pp.77-87, 2002.
[4] K. Fellenberg, N. Hauser, B. Brors, A. Neutzner, J. Hoheisel, and M. Vingron, “Correspondence Analysis Applied to Microarray Data,” Proc. Nat'l Academy of Sciences USA, vol. 98, no. 19, pp.10781-10786, 2001.
[5] M.T. Lee, Analysis of Microarray Gene Expression Data. Kluwer Academic, 2004.
[6] R. Somogyi and C. Sniegoski, “Modeling the Complexity of Genetic Networks: Understanding Multigenetic and Pleiotropic Regulation,” Complexity, vol. 1, pp. 45-63, 1996.
[7] S. Yeung, J. Tegner, and J. Collins, “Reverse Engineering Gene Networks Using Simgular Value Decomposition and Robust Regression,” Proc. Nat'l Academy of Sciences USA, vol. 99, no. 9, pp. 6163-6168, 2002.
[8] C. Becquet, S. Blachon, B. Jeudy, J. Boulicaut, and O. Gandrillon, “Strong-Association-Rule Mining for Large-Scale Gene-Expression Data Analysis: A Case Study on Human SAGE Data,” Genome Biology, vol. 3, no. 12, pp. 67.1-67.16, 2002.
[9] B. Zhang and S. Horvath, “A General Framework for Automatic Construction of Gene Co-Expression Networks,” Statistical Applications in Genetics and Molecular Biology, vol. 4, no. 1, 2005.
[10] T. Mitchell, Machine Learning. McGraw-Hill, 1997.
[11] C. Ding and H. Peng, “Minimum Redundancy Feature Selection from Microarray Gene Expression Data,” J. Bioinformatics and Computational Biology, vol. 3, no. 2, pp. 185-205, 2005.
[12] I. Shmulevich and W. Zhang, “Binary Analysis and Optimization-Based Normalization of Gene Expression Data,” Bioinformatics, vol. 18, no. 4, pp. 555-565, 2002.
[13] H. Liu, F. Hussain, C.L. Tan, and M. Dash, “A Systematic Study of Discretization Methods,” J. Data Mining and Knowledge Discovery, vol. 6, pp. 393-423, 2002.
[14] K.Y. Yip, D.W. Cheung, and M.K. Ng, “HARP: A Practical Projected Clustering Algorithm,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004.
[15] Y. Cheng and G. Church, “Biclustering of Expression Data,” Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology, 2000.
[16] Z. Huang, “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, vol. 2, pp. 283-304, 1998.
[17] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proc. 15th Int'l Conf. Data Eng. (ICDE), pp. 512-521, 1999.
[18] D. Gibson, J. Kleinberg, and P. Raghavan, “Clustering Categorical Data: An Approach Based on Dynamical Systems,” Proc. 24th Int'l Conf. Very Large Data Bases (VLDB), pp. 311-322, 1998.
[19] D. Barbará, J. Couto, and Y. Li, “COOLCAT: An Entropy-Based Algorithm for Categorical Clustering,” Proc. 11th Int'l Conf. Information and Knowledge Management (CIKM), pp. 582-589, 2002.
[20] R.G. Pensa, C. Robardet, and J.-F. Boulicaut, “A Bi-Clustering Framework for Categorical Data,” Proc. Knowledge Discovery and Databases (PKDD '05), A.M. Jorge, L. Torgo, P.B. Brazdil, R.Camacho, and J. Gama, eds., pp. 643-650, 2005.
[21] D.S. Moore and G.P. McCabe, Introduction to the Practice of Statistics, fourth ed. W.H. Freeman, 2002.
[22] M. Cáceres, “Elevated Gene Expression Levels Distinguish Human from Non-Human Primate Brains,” Proc. Nat'l Academy of Sciences USA, vol. 100, pp. 13030-13035, 2003.
[23] C. Li and W.H. Wong, “Model-Based Analysis of Oligonucleotide Arrays: Expression Index Computation and Outlier Detection,” Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 31-36, 2001.
[24] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Inc., 1990.
[25] G. Dennis, B. Sherman, D. Hosack, J. Yang, W. Gao, H. Lane, and R. Lempicki, “DAVID: Database for Annotation, Visualization, and Integrated Discovery,” Genome Biology, vol. 4, no. 9, 2003.
[26] T. Lu et al., “Gene Regulation and DNA Damage in the Aging Human Brain,” Nature, vol. 429, no. 24, pp. 883-891, 2004.
[27] P. Spellman, G. Sherlock, M. Zhang, V. Iyer, and K. Anders, “Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization,” Molecular Biological Cell, vol. 9, no. 12, pp. 3273-3297, 1998.
[28] A. Ben-Dor, R. Shamir, and Z. Yakhini, “Clustering Gene Expression Patterns,” J. Computational Biology, vol. 6, no. 3, pp.281-297, 1999.

Index Terms:
Dense region, clustering, categorical data, bicluster, microarray, gene expression, coexpressed genes.
Andy M. Yip, Michael K. Ng, Edmond H. Wu, Tony F. Chan, "Strategies for Identifying Statistically Significant Dense Regions in Microarray Data," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 415-429, July-Sept. 2007, doi:10.1109/TCBB.2007.1022
Usage of this product signifies your acceptance of the Terms of Use.