Issue No.01 - Jan.-Feb. (2013 vol.10)
pp: 161-172
Itziar Irigoien , Dept. of Comput. Sci. & Artificial Intell., Univ. of the Basque Country, Donostia, Spain
Francesc Mestres , Dept. of Genetics, Univ. of Barcelona, Barcelona, Spain
Concepcion Arenas , Dept. of Stat., Univ. of Barcelona, Barcelona, Spain
This paper presents a solution to the problem of how to identify the units in groups or clusters that have the greatest degree of centrality and best characterize each group. This problem frequently arises in the classification of data such as types of tumor, gene expression profiles or general biomedical data. It is particularly important in the common context that many units do not properly belong to any cluster. Furthermore, in gene expression data classification, good identification of the most central units in a cluster enables recognition of the most important samples in a particular pathological process. We propose a new depth function that allows us to identify central units. As our approach is based on a measure of distance or dissimilarity between any pair of units, it can be applied to any kind of multivariate data (continuous, binary or multiattribute data). Therefore, it is very valuable in many biomedical applications, which usually involve noncontinuous data, such as clinical, pathological, or biological data sources. We validate the approach using artificial examples and apply it to empirical data. The results show the good performance of our statistical approach.
Kernel, Gaussian distribution, Gene expression, Bioinformatics, Computational biology, Covariance matrix, Context, gene expression data, Cluster analysis, kernel, data depth, depth function, central unit, geometric variability, proximity function
Itziar Irigoien, Francesc Mestres, Concepcion Arenas, "The Depth Problem: Identifying the Most Representative Units in a Data Group", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 1, pp. 161-172, Jan.-Feb. 2013, doi:10.1109/TCBB.2012.147
[1] J.A. Hartigan, Clustering Algorithms. John Wiley and Sons, 1975.
[2] T. Kohonen, Self-Organizing Maps. Springer, 1997.
[3] J.B. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, pp. 281-297, 1967.
[4] C. Fraley and A.E. Raftery, “Model-Based Clustering, Discriminant Analysis, and Density Estimation,” J. Am. Statistical Assoc., vol. 97, pp. 611-631, 2002.
[5] G.J. McLachlan and K.E. Basford, Mixture Models: Inference and Applications to Clustering. Marcel Dekker, 1988.
[6] G.J. McLachlan and D. Peel, Finite Mixture Models. John Wiley and Sons, 2000.
[7] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998.
[8] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovssky, E.S. Lander, and T.R. Golub, “Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation,” Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 2907-2912, 1999.
[9] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church, “Systematic Determination of Genetic Network Architecture,” Nature Genetics, vol. 22, pp. 281-285, 1999.
[10] G.J. McLachlan, R.W. Bean, and D. Peel, “A Mixture Model-Based Approach to the Clustering of Microarray Expression Data,” Bioinformatics, vol. 18, pp. 413-422, 2002.
[11] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzzo, “Model-Based Clustering and Data Transformations for Gene Expression Data,” Bioinformatics, vol. 17, pp. 977-987, 2001.
[12] M.J. Van der Lann, K. Pollard, and J. Bryan, “A New Partitioning Around Medoids Algorithm,” J. Statistical Computation and Simulation, vol. 73, pp. 575-584, 2003.
[13] P.J. Rousseeuw, I. Ruts, and J.W. Tuckey, “The Bagplot: A Bivariate Boxplot,” Am. Statistician, vol. 53, pp. 382-387, 1999.
[14] J.W. Tukey, “Mathematics and Picturing Data,” Proc. Int'l Congress Math., vol. 2, pp. 523-531, 1975.
[15] Y. Zuo and R. Serfling, “General Notions of Statistical Depth Function,” Annals of Statistics, vol. 28, pp. 461-482, 2000.
[16] R. Liu, “On a Notion of Data Depth Based on Random Simplices,” Annals of Statistics, vol. 18, pp. 405-414, 1990.
[17] R. Liu and K. Singh, “A Quality Index Based on Data Depth and Multivariate Rank Tests,” J. Am. Statistical Assoc., vol. 88, pp. 252-260, 1995.
[18] P.J. Rousseeuw and A. Struyf, “Computing Location Depth and Regression Depth in Higher Dimensions,” Statistical Computation, vol. 8, pp. 193-203, 1998.
[19] R. Liu, P. Parelius, and K. Singh, “Multivariate Analysis by Data Depth: Descriptive Statistics, Graphics and Inference (with Discussion),” Annals of Statistics, vol. 27, pp. 783-858, 1999.
[20] Y. Vardi and C.H. Zhang, “The Multivariate $L_1$ -Median and Associated Data Depth,” Proc. Nat'l Academy of Sciences USA, vol. 97, pp. 1423-1426, 2000.
[21] M.A. Burr, E. Rafalin, and D.L. Souvaine, “Simplicial Depth: An Improved Definition Analysis, and Efficiency for the Finite Sample Case,” Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications, vol. 72, pp. 195-209, 2006.
[22] I. Cascos, “The Expected Convex Hull Trimmed Regions of a Sample,” Computation Statistics, vol. 22, pp. 557-569, 2007.
[23] I. Cascos, “Data Depth: Multivariate Statistics and Geometry,” Data Depth in Multivariate Statistics, W.S. Kendall and I. Molchanov, eds., vol. 171, pp. 398-423, Oxford Univ. Press, 2010.
[24] A. Christmann, Classification Based on the SVM and on Regression Depth, Statistical Data Analysis Based on the L1-Norm and Related Methods. Statistics for Industry and Tech nology, 2002.
[25] R. Jörnsten, Y. Vardi, and C.-H. Zhang, “A Robust Clustering Method and Visualization Tool Based on Data Depth, Statistical Data Analysis Based on the L1-Norm and Related Methods,” Statistics for Industry and Technology, 2002.
[26] R. Jörnsten, “Clustering and Classification Based on the $L_1$ Data Depth,” J. Multivariate Analysis, vol. 90, pp. 67-89, 2004.
[27] L. Kaufman and P.J. Rousseeuw, “Clustering by Means of Medoids,” Statistical Data Analysis Based on the L1 Norm, Y. Dodge, ed., pp. 405-416, North-Holland, 1987.
[28] M.C.N. Barioni, H.L. Razente, A.J.M. Traina, and C. Traina, Jr, “An Efficient Approach to Scale up K-Medoid Based Algorithms in Large Datasets,” Proc. 21th Brazilian Symp. Databases (SBBD), pp. 265-279, 2006.
[29] D. Dueck, “Clustering by Passing Messsages between Data Points,” Science, vol. 315, pp. 972-949, 2007.
[30] J. Bien and R. Tibshirani, “Hierarchical Clustering with Prototypes via Minimax Linkage,” J. Am. Statistical Assoc., vol. 106, pp. 1075-1084, 2011.
[31] J.C. Gower, “Measures of Similarity, Dissimilarity and Distance,” Encyclopedia of Statistical Sciences, S. Kotz, N.L. Johson, and C.B. Read, eds., pp. 307-316, Wiley, 1985.
[32] C.M. Cuadras and C. Arenas, “A Distance Based Regression Model for Prediction with Mixed Data,” Comm. Statistics, Theory and Methods, vol. 19, pp. 2261-2279, 1990.
[33] P. Legendre and M.J. Anderson, “Distance-Based Redundancy Analysis: Testing Multispecies Responses in Multifactorial Ecological Experiments,” Ecological Monographs, vol. 48, pp. 505-519, 1999.
[34] M.J. Anderson and J. Robinson, “Generalized Discriminant Analysis Based on Distances,” Australian and New Zealand J. Statistics, vol. 45, pp. 301-318, 2003.
[35] M.J. Anderson and T.J. Willis, “Canonical Analysis of Principal Coordinates: A Useful Method of Constrained Ordination for Ecology,” Ecology, vol. 84, pp. 511-525, 2003.
[36] W.J. Krzanowski, “Biplots for Multifactorial Analysis of Distance,” Biometrics, vol. 60, pp. 517-524, 2003.
[37] I. Irigoien and C. Arenas, “INCA: New Statistics for Estimating the Number of Clusters and Identifying Atypical Units,” Statistics in Medicine, vol. 27, pp. 2948-2973, 2008.
[38] I. Irigoien, C. Arenas, E. Fernandez, and F. Mestres, “GEVA: Geometric Variability-Based Approaches for Identifying Patterns in Data,” Computation in Statistics, vol. 25, pp. 241-255, 2010.
[39] I. Irigoien, B. Sierra, and C. Arenas, “ICGE: An R Package for Detecting Relevant Clusters and Atypical Units in Gene Expression,” BMC Bioinformatics, vol. 13, article 30, 30, 2012.
[40] J.C. Lingoes, “Some Boundary Conditions for a Monotone Analysis of Symmetric Matrices,” Psychometrika, vol. 36, pp. 195-203, 1971.
[41] J.C. Gower and P. Legendre, “Metric and Euclidean Properties of Dissimilarity Coefficients,” J. Classification, vol. 3, pp. 5-48, 1986.
[42] C.R. Rao, “Diversity: Its Measurement, Decomposition, Apportionment and Analysis,” Sankhyā. The Indian J. Statistics, vol. 44, pp. 1-22, 1982.
[43] C.M. Cuadras and J. Fortiana, “A Continuous Metric Scaling Solution for a Random Variable,” J. Multivariate Analysis, vol. 32, pp. 1-14, 1995.
[44] C.M. Cuadras, J. Fortiana, and F. Oliva, “The Proximity of an Individual to a Population with Applications in Discriminant Analysis,” J. Classification, vol. 14, pp. 117-136, 1997.
[45] C. Arenas and C.M. Cuadras, “Some Recent Statistical Methods Based on Distances,” Contributions to Science, vol. 2, pp. 183-191, 2002.
[46] G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. John Wiley and Sons, 1992.
[47] F.T. Liu, K.M. Ting, and Z. Zhou, “Isolation Forest,” Proc. IEEE Eight Int'l Conf. Data Mining, pp. 413-422, 1998.
[48] A. Bhattacharyya, “On a Measure of Divergence of Two Multinomial Populations,” Sankhyā: The Indian J. Statistics, Series A, vol. 7, pp. 117-136, 1997.
[49] I. Irigoien, S. Vives, and C. Arenas, “Microarray Time Course Experiments: Finding Profiles,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 2, pp. 464-475, Mar./Apr. 2011.
[50] P. Langfelder and S. Horvath, “Eigengene Networks for Studying the Relationships between Co-Expression Modules,” BMC Systems Biology, vol. 1, pp. 1-54, 2007.
[51] M. Elter, R. Schulz-Wendtland, and T. Wittenberg, “The Prediction of Breast Cancer Biopsy Outcomes Using Two CAD Approaches that Both Emphasize an Intelligible Decision Process,” Medical Physics, vol. 34, pp. 4164-4172, 2007.
[52] J.C. Gower, “A General Coefficient of Similarity and Some of Its Properties,” Biometrics, vol. 27, pp. 857-871, 1971.
[53] A. Montanari and S. Mignari, “Notes On the Bias of Dissimilarity Indices for Incomplete Data Sets: The Case of Archaeological Classifications,” Qüestiió, vol. 18, pp. 39-49, 1994.
[54] H. Oja, “Descriptive Statistics for Multivariate Distributions,” Statistics Probability Letters, vol. 1, pp. 327-333, 1983.
[55] D. Chowdary, J. Lathrop, J. Skelton, K. Curtin, T. Briggs, Y. Zhang, J. Yu, Y. Wang, and A. Mazumder, “Prognostic Gene Expression Signatures Can be Measured in Tissues Collected in RNAlater Preservative,” J. Molecular Diagnosis, vol. 8, pp. 31-39, 2006.
[56] K. Kato, “Adaptor-Tagged Competitive PCR: A Novel Method for Measuring Relative Gene Expression,” Nucleic Acids Research, vol. 25, pp. 4694-4696, 1997.
[57] J. Sese, Y. Kurokawa, M. Monden, K. Kato, and S. Morishita, “Constrained Cluster Gene Expression Profiles with Pathological Features,” Bioinformatics, vol. 20, pp. 3137-3145, 2004.
[58] S. Paoli, G. Jurman, D. Albanese, S. Merler, and C. Furlanello, “Integrating Gene Expression Profiling and Clinical Data,” Int'l J. Approximate Reasoning, vol. 47, pp. 58-69, 2008.