Subscribe
Issue No.02 - April-June (2010 vol.7)
pp: 375-381
Giora Unger , Tel-Aviv University, Tel-Aviv
Benny Chor , Tel-Aviv University, Tel-Aviv
ABSTRACT
We study simple geometric properties of gene expression data sets, where samples are taken from two distinct classes (e.g., two types of cancer). Specifically, the problem of linear separability for pairs of genes is investigated. If a pair of genes exhibits linear separation with respect to the two classes, then the joint expression level of the two genes is strongly correlated to the phenomena of the sample being taken from one class or the other. This may indicate an underlying molecular mechanism relating the two genes and the phenomena (e.g., a specific cancer). We developed and implemented novel efficient algorithmic tools for finding all pairs of genes that induce a linear separation of the two sample classes. These tools are based on computational geometric properties and were applied to 10 publicly available cancer data sets. For each data set, we computed the number of actual separating pairs and compared it to an upper bound on the number expected by chance and to the numbers resulting from shuffling the labels of the data at random empirically. Seven out of these 10 data sets are highly separable. Statistically, this phenomenon is highly significant, very unlikely to occur at random. It is therefore reasonable to expect that it manifests a functional association between separating genes and the underlying phenotypic classes.
INDEX TERMS
Gene expression analysis, DNA microarrays, diagnosis, linear separation.
CITATION
Giora Unger, Benny Chor, "Linear Separability of Gene Expression Data Sets", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 2, pp. 375-381, April-June 2010, doi:10.1109/TCBB.2008.90
REFERENCES
 [1] "The Chipping Forecast," Nature Genetics, special supplement, vol. 21, Jan. 1999. [2] A. Ben-Dor, A.L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, "Tissue Classification with Gene Expression Profiles," J. Computational Biology, vol. 7, no. 3/4, pp. 559-583, 2000. [3] T.H. Bø and I. Jonassen, "New Feature Subset Selection Procedures for Classification of Expression Profiles," Genome Biology, vol. 3, no. 4, pp. 0017.1-0017.11, Mar. 2002. [4] T. Breslin, P. Eden, and M. Krogh, "Comparing Functional Annotation Analyses with Catmap," BMC Bioinformatics, vol. 5, no. 193,doi:10.1186/1471-2105-5-193, 2004. [5] K. Crammer, "MCSVM_1.0: C Code for Multiclass SVM," http://www.cis. upenn.edu~crammer, 2003. [6] M. Dettling and P. Buhlmann, "Finding Predictive Gene Groups from Microarray Data," J. Multivariate Analysis, vol. 90, pp. 106-131, 2004. [7] T.K. Dey, "Improved Bounds for Planar $k$ -Sets and Related Problems," Discrete and Computational Geometry, vol. 19, no. 3, pp. 373-382, 1998. [8] A. Bhattacharjee et al., "Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses," Proc. Nat'l Academy of Sciences of the USA, vol. 98, no. 24, pp. 13790-13795, Nov. 2001. [9] D.G. Beer et al., "Gene-Expression Profiles Predict Survival of Patients with Lung Adenocarcinoma," Nature Medicine, vol. 8, no. 8, pp. 816-824, Aug. 2002. [10] G. Wright et al., "A Gene Expression-Based Method to Diagnose Clinically Distinct Subgroups of Diffuse Large B-Cell Lymphoma," Proc. Nat'l Academy of Sciences of the USA, vol. 100, no. 17, pp. 9991-9996, Sept. 2003. [11] G.J. Gordon et al., "Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma," Cancer Research, vol. 62, no. 17, pp. 4963-4967, Sept. 2002. [12] J. Khan et al., "Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks," Nature Medicine, vol. 7, no. 6, pp. 673-679, June 2001. [13] J. Weston et al., "Use of the Zero-Norm with Linear Models and Kernel Methods," J. Machine Learning Research, vol. 3, pp. 1439-14361, Mar. 2003. [14] S. Kim et al., "Strong Feature Sets from Small Samples," J. Computational Biology, vol. 9, no. 1, pp. 127-146, 2002. [15] S. Mukherjee et al., "Estimating Dataset Size Requirements for Classifying DNA Microarray Data," J. Computational Biology, vol. 10, no. 2, pp. 119-142, 2003. [16] T.R. Golub et al., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, pp. 531-537, Oct. 1999. [17] U. Alon et al., "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays," Proc. Nat'l Academy of Sciences of the USA, vol. 96, no. 12, pp. 6745-6750, June 1999. [18] T.R. Golub, "Genome-Wide Views of Cancer," New England J. Medicine, vol. 344, no. 8, pp. 601-602, Feb. 2001. [19] H. Liu, ed., "Evolving Feature Selection," IEEE Intelligent Systems, vol. 24, no. 4, pp. 64-76, 2005. [20] J.K. Bertrand, K.R. Robbins, W. Zhang, and R. Rekaya, "The Ant Colony Algorithm for Feature Selection in High-Dimension Gene Expression Data for Disease Classification," Math. Medicine and Biology, vol. 24, no. 4, pp. 413-426, 2007. [21] S. Ramaswamy and T.R. Golub, "DNA Microarrays in Clinical Oncology," J. Clinical Oncology, vol. 20, no. 7, pp. 1932-1941, Apr. 2002. [22] G. Unger, "Linear Separability and Classifiability of Gene Expression Datasets," Tel-Aviv Univ., MSc thesis, http://www.tau.ac.il/~giorau/documentsgiora_thesis.pdf , Feb. 2004. [23] L.J. van't Veer et al., "Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer," Nature, vol. 415, no. 6871, pp. 530-536, Jan. 2002.