This Article 
 Bibliographic References 
 Add to: 
A Mathematical Model for the Validation of Gene Selection Methods
September/October 2011 (vol. 8 no. 5)
pp. 1385-1392
Marco Muselli, Consiglio Nazionale delle Ricerche, Genova
Alberto Bertoni, Università degli Studi di Milano, Milano
Marco Frasca, Università degli Studi di Milano, Milano
Alessandro Beghini, Università degli Studi di Milano, Milano
Francesca Ruffino, Università degli Studi di Milano, Milano
Giorgio Valentini, Università degli Studi di Milano, Milano
Gene selection methods aim at determining biologically relevant subsets of genes in DNA microarray experiments. However, their assessment and validation represent a major difficulty since the subset of biologically relevant genes is usually unknown. To solve this problem a novel procedure for generating biologically plausible synthetic gene expression data is proposed. It is based on a proper mathematical model representing gene expression signatures and expression profiles through Boolean threshold functions. The results show that the proposed procedure can be successfully adopted to analyze the quality of statistical and machine learning-based gene selection algorithms.

[1] A. Syvanen, “Accessing Genetic Variation: Genotyping Single Nucleotide Polymorphisms,” Nature Rev. Genetics, vol. 2, no. 18, pp. 930-942, 2001.
[2] M. Shinawi and S. Cheung, “The Array CGH and Its Clinical Applications,” Drug Discovery Today, vol. 13, nos. 17/18, pp. 760-770, 2008.
[3] D. Lockhart and E. Winzeler, “Genomics, Gene Expression and DNA Arrays,” Nature, vol. 405, pp. 827-836, 2000.
[4] M. Kanehisa, M. Araki, S. Goto, M. Hattori, M. Hirakawa, M. Itoh, T. Katayama, S. Kawashima, S. Okuda, T. Tokimatsu, and Y. Yamanishi, “Kegg for Linking Genomes to Life and the Environment,” Nucleic Acids Research, vol. 36, pp. D480-D484, 2008.
[5] D. Allison, X. Cui, G. Page, and M. Sabripour, “Microarray Data Analysis: From Disarray to Consolidation and Consensus,” Nature Rev. Genetics, vol. 7, no. 1, pp. 55-65, 2006.
[6] S. Wang and Q. Cheng, “Microarray Analysis in Drug Discovery and Clinical Applications,” Methods Molecular Biology, vol. 316, pp. 49-65, 2006.
[7] Z. Lee, “An Integrated Algorithm for Gene Selection and Classification Applied to Microarray Data of Ovarian Cancer,” Artificial Intelligence in Medicine, vol. 42, no. 1, pp. 81-93, 2008.
[8] J. Dopazo, “Functional Interpretation of Microarray Experiments,” OMICS: A Journal of Integrative Biology , vol. 10, no. 3, pp. 398-410, 2006.
[9] U. Braga-Neto and E. Dougherty, “Is Cross-Validation Valid for Small-Sample Microarray Classification?,” Bioinformatics, vol. 20, pp. 374-380, 2004.
[10] W. Fu, R. Carroll, and S. Wang, “Estimating Misclassification Error with Small Samples via Bootstrap Cross-validation,” Bioinformatics, vol. 21, no. 9, pp. 1979-1986, 2005.
[11] A. Molinaro, R. Simon, and R. Pfeiffer, “Prediction Error Estimation: A Comparison of Resampling Methods,” Bioinformatics, vol. 21, no. 15, pp. 3301-3307, 2005.
[12] S. Dudoit and J. Fridlyand, “A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Dataset,” Genome Biology, vol. 3, no. 7, pp. 1-21, 2002.
[13] T. Lange, V. Roth, M. Braun, and J. Buhmann, “Stability-Based Validation of Clustering Solutions,” Neural Computation, vol. 16, pp. 1299-1323, 2004.
[14] G. Valentini, “Mosclust: A Software Library for Discovering Significant Structures in Bio-Molecular Data,” Bioinformatics, vol. 23, no. 3, pp. 387-389, 2007.
[15] E. Amaldi and V. Kann, “On the Approximation of Minimizing Non Zero Variables or Unsatisfied Relations in Linear Systems,” Theoretical Computer Science, vol. 209, pp. 237-260, 1998.
[16] L. Ein-Dor, O. Zuk, and E. Domany, “Thousands of Samples Are Needed to Generate a Robust Gene List for Predicting Outcome in Cancer,” Proc. Nat'l Academy of Sciences USA, vol. 103, pp. 5923-5928, 2006.
[17] C. Lai, M.J.T. Reinders, L.J. van't Veer, and L.F.A. Wessels, “A Comparison of Nivariate and Multivariate Gene Selection Techniques for Classification of Cancer Datasets,” BMC Bioinformatics, vol. vol. 7:235, 2006.
[18] C. Lai, M. Reinders, L. van't Veer, and L. Wessels, “A Protocol for Building and Evaluating Predictors of Disease State Based on Microarray Data,” BMC Bioinformatics, vol. 7, no. 235, pp. 3755-3762, 2006.
[19] A. Choudary, M. Brun, J. Hua, J. Lowey, E. Suh, and E. Dougherty, “Genetic Test Bed for Feature Selection,” Bioinformatics, vol. 22, no. 7, pp. 837-842, 2006.
[20] M.J. van de Vijver et al., “A Gene-Expression Signature as a Predictor of Survival in Breast Cancer,” New England J. Medicine, vol. 374, pp. 1999-2009, 2002.
[21] T. Van den Bulcke, K. Van Leemput, B. Naudts, P. van Remortel, H. Ma, A. Verschoren, B. De Moor, and K. Marchal, “Syntren: A Generator of Synthetic Gene Expression Data for Design and Analysis of Structure Learning Algorithms,” BMC Bioinformatics, vol. 7, pp. 7-43, 2006.
[22] J. Weston, A. Elisseeff, B. Scholkopf, and M. Tipping, “Use of the Zero-Norm with Linear Models and Kernels Methods,” J. Machine Learning Research, vol. 3, pp. 1439-1461, 2003.
[23] S. Dudoit and J. Fridlyand, “Bagging to Improve the Accuracy of a Clustering Procedure,” Bioinformatics, vol. 19, no. 9, pp. 1090-1099, 2003.
[24] J. Koo, I. Sohn, K. Sujong, and J. Won Lee, “Structured Polychotomous Machine Diagnosis of Multiple Cancer Types Using Gene Expression,” Bioinformatics, vol. 22, no. 8, pp. 950-958, 2006.
[25] F. Ruffino, M. Muselli, and G. Valentini, “Gene Expression Modelling through Positive Boolean Functions,” Int'l J. Approximate Reasoning, vol. 47, no. 1, pp. 97-108, 2008.
[26] G. Towell and J. Shavlik, “Extracting Refined Rules from Knowledge-Based Neural Networks,” Machine Learning, vol. 131, pp. 71-101, 1993.
[27] A. Alizadeh, M. Eisen, R. Davis, C. Ma, I. Lossos, A. Rosenwald, J. Boldrick, H. Sabet, T. Tran, X. Yu, J. Powell, L. Yang, G. Marti, T. Moore, J. Hudson, L. Lu, D. Lewis, R. Tibshirani, G. Sherlock, W. Chan, T. Greiner, D. Weisenburger, J. Armitage, R. Warnke, R. Levy, W. Wilson, M. Grever, J. Byrd, D. Botstein, P. Brown, and L. Staudt, “Distinct Types of Diffuse Large B-cell Lymphoma Identified by Gene Expression Profiling,” Nature, vol. 403, pp. 503-511, 2000.
[28] A. Martoglio, J. Miskin, S. Smith, and D. MacKay, “A Decomposition Model to Track Gene Expression Signatures: Preview on Observer-Independent Classification of Ovarian Cancer,” Bioinformatics, vol. 18, no. 12, pp. 1617-1624, 2002.
[29] L. Dyrskjøt, T. Thykjaer, M. Kruhøffer, J. Jensen, N. Marcussen, S. Hamilton-Dutoit, H. Wolf, and T. Ørntoft, “Identifying Distinct Classes of Bladder Carcinoma Using Microarrays,” Nature Genetics, vol. 33, no. 1, pp. 90-96, 2003.
[30] S. McCarroll, C. Murphy, S. Zou, S. Pletcher, C. Chin, Y. Jan, C. Kenyon, C. Bargmann, and H. Li, “Comparing Genomic Expression Patterns across Species Identifies Shared Transcriptional Profile in Aging,” Nature Genetics, vol. 36, no. 2, pp. 197-204, 2004.
[31] Y. Yu, J. Khan, C. Khanna, L. Helman, P. Meltzer, and G. Merlino, “Expression Profiling Identifies the Cytoskeletal Organizer Ezrin and the Developmental Homoprotein Six-1 as Key Metastatic Regulators,” Nature Medicine, vol. 10, no. 2, pp. 175-181, 2004.
[32] M. Tompa et al., “Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites,” Nature Biotechnology, vol. 1, no. 23, pp. 137-44, 2005.
[33] X. Cui and G. Churchill, “Statistical Tests for Differential Expression in cDNA Microarray Experiments,” Genome Biology, vol. 4, no. 4:210, 2003.
[34] M. Eisen, P. Spellman, P. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences USA, vol. 95, no. 25, pp. 14863-14868, 1998.
[35] D. Kotska and R. Spang, “Finding Disease Specific Alterations in the Co-Expression of Genes,” Bioinformatics, vol. 20, no. suppl.1, pp. i194-i199, 2004.
[36] S. Ramaswamy, K. Ross, E. Lander, and T. Golub, “A Molecular Signature of Metastasis in Primary Solid Tumors,” Nature Genetics, vol. 33, pp. 49-54, 2003.
[37] J. Ihmels, S. Bergmann, and N. Barkai, “Defining Transcription Modules Using Large-Scale Gene Expression Data,” Bioinformatics, vol. 20, no. 13, pp. 1993-2003, 2004.
[38] Q. Ye, L. Qin, M. Forgues, P. He, J. Kim, A. Peng, R. Simon, Y. Li, A. Robles, Y. Chen, Z. Ma, Z. Wu, S. Ye, Y. Liu, Z. Tang, and X. Wang, “Predicting Hepatitis b Virus-Positive Metastatic Hepatocellular Carcinomas Using Gene Expression Profiling and Supervised Machine Learning,” Nature Medicine, vol. 9, no. 4, pp. 416-423, 2003.
[39] T. Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, 1999.
[40] J. Chen, R. Delongchamp, C. Tsai, H. Hsueh, F. Sisatare, K. Thompson, V. Deasi, and J. Fuscoe, “Analysis of Variance Components in Gene Expression Data,” Bioinformatics, vol. 20, no. 9, pp. 1436-1446, 2004.
[41] V. Cheung, L. Conlin, T. Weber, M. Arcaro, K. Jen, M. Morley, and R. Spielman, “Natural Variation in Human Gene Expression Assessed in Lymphoblastoid Cells,” Nature Genetics, vol. 33, no. 3, pp. 422-425, 2003.
[42] M. Ben-Or and N. Linial, “Collective Coin Flipping,” Randomness and Computation, pp. 91-115, Academic Press, 1990.
[43] P. Diggle, Statistical Analysis of Spatial Point Patterns. Academic Press, 1983.
[44] U. Alon et al., “Broad Patterns of Gene Expressions Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 6745-6750, 1999.
[45] M. Shipp, K. Ross, P. Tamayo, A. Weng, J. Kutok, R. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. Pinkus, T. Ray, M. Koval, K. Last, A. Norton, T. Lister, J. Mesirov, D. Neuberg, E. Lander, J. Aster, and T. Golub, “Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene-Expression Profiling and Supervised Machine Learning,” Nature Medicine, vol. 8, no. 1, pp. 68-74, 2002.
[46] M. Young-Park and T. Hastie, “L1-Regularization Path Algorithm for Generalized Linear Models,” J. Royal Statistical Soc. B, vol. 69, no. 4, pp. 659-677, 2007.
[47] H. Zhou and T. Hastie, “Regularization and Variable Selection via the Elastic Net,” J. the Royal Statistical Soc. B, vol. 67, no. 2, pp. 301-320, 2007.
[48] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
[49] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, nos. 1-3, pp. 389-422, 2002.
[50] K. Kapur, Y. Xing, Z. Ouyang, and W. Wong, “Exon Array Assessment of Gene Expression,” Genome Biology, vol. 8, no. 5:R82, 2007.
[51] T. LaFramboise, “Single Nucleotide Polymorphism Arrays: A Decade of Biological, Computational and Technological Advances,” Nucleic Acids Research, vol. 37, no. 13, pp. 4181-4193, 2009.
[52] V. Matys et al., “TRANSFAC: Transcriptional Regulation, from Patterns to Profiles,” Nucleic Acids Research, vol. 31, no. 1, pp. 374-378, 2003.
[53] J. Bullard, E. Purdom, K. Hansen, and S. Dudoit, “Evaluation of Statistical Methods for Normalization and Differential Expression in Mrna-seq Experiments,” BMC Bioinformatics, vol. 11:94, 2010.
[54] W. Noble and A. Ben-Hur, “Integrating Information for Protein Function Prediction,” Bioinformatics—From Genomes to Therapies, T. Lengauer, ed., vol. 3, pp. 1297-1314, Wiley-VCH, 2007.
[55] J. Dopazo, “Formulating and Testing Hypotheses in Functional Genomics,” Artificial Intelligence in Medicine, vol. 45, nos. 2/3, pp. 97-107, 2009.
[56] C. De Mol, S. Mosci, M. Traskine, and A. Verri, “A Regularized Method for Selecting Nested Groups of Relevant Genes from Microarray Data,” J. Computational Biology, vol. 16, pp. 1-14, 2009.
[57] J. Shaik and M. Yeasin, “Fuzzy-Adaptive-Subspace-Iteration-Based Two-Way Clustering of Microarray Data,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 6, no. 2, pp. 244-259, Apr.-June 2009.
[58] S. Niijima and Y. Okuno, “Laplacian Linear Discriminant Analysis Approach to Unsupervised Feature Selection,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 6, no. 4, pp. 605-614, Oct.-Dec. 2009.

Index Terms:
Gene selection, feature selection, mathematical models, gene expression, Boolean functions.
Marco Muselli, Alberto Bertoni, Marco Frasca, Alessandro Beghini, Francesca Ruffino, Giorgio Valentini, "A Mathematical Model for the Validation of Gene Selection Methods," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 5, pp. 1385-1392, Sept.-Oct. 2011, doi:10.1109/TCBB.2010.83
Usage of this product signifies your acceptance of the Terms of Use.