Subscribe
Issue No.03 - March (2014 vol.26)
pp: 667-681
Yuhai Zhao , Northeastern University, Shenyang
Guoren Wang , Northeastern University, Shenyang
Xiang Zhang , Case Western Reserve University, Cleveland
Jeffrey Xu Yu , Chinese University of Hong Kong, Hong Kong
Zhanghui Wang , Northeastern University, Shenyang
ABSTRACT
Advanced microarray technologies have enabled to simultaneously monitor the expression levels of all genes. An important problem in microarray data analysis is to discover phenotype structures. The goal is to 1) find groups of samples corresponding to different phenotypes (such as disease or normal), and 2) for each group of samples, find the representative expression pattern or signature that distinguishes this group from others. Some methods have been proposed for this issue, however, a common drawback is that the identified signatures often include a large number of genes but with low discriminative power. In this paper, we propose a $(g^\ast)$-sequence model to address this limitation, where the ordered expression values among genes are profitably utilized. Compared with the existing methods, the proposed sequence model is more robust to noise and allows to discover the signatures with more discriminative power using fewer genes. This is important for the subsequent analysis by the biologists. We prove that the problem of phenotype structure discovery is NP-complete. An efficient algorithm, FINDER, is developed, which includes three steps: 1) trivial $(g^\ast)$-sequences identifying, 2) phenotype structure discovery, and 3) refinement. Effective pruning strategies are developed to further improve the efficiency. We evaluate the performance of FINDER and the existing methods using both synthetic and real gene expression data sets. Extensive experimental results show that FINDER dramatically improves the accuracy of the phenotype structures discovered (in terms of both statistical and biological significance) and detects signatures with high discriminative power. Moreover, it is orders of magnitude faster than other alternatives.
INDEX TERMS
microarray data, Data mining, bioinformatics,
CITATION
Yuhai Zhao, Guoren Wang, Xiang Zhang, Jeffrey Xu Yu, Zhanghui Wang, "Learning Phenotype Structure Using Sequence Model", IEEE Transactions on Knowledge & Data Engineering, vol.26, no. 3, pp. 667-681, March 2014, doi:10.1109/TKDE.2013.31
REFERENCES
 [1] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church, "Systematic Determination of Genetic Network Architecture," Nature Genetics, vol. 22, pp. 281-85, 1999. [2] M. Eisen, P. Spellman, P. Brown, and D. Botstein, "Cluster Analysis and Display of Genome-Wide Expression Patterns," Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 14 863-68, 1998. [3] A. Alizadeh, "Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling," Nature, vol. 403, pp. 503-11, 2000. [4] C. Tang, A. Zhang, and M. Ramanathan, "ESPD: A Pattern Detection Model Underlying Gene Expression Profiles," Bioinformatics, vol. 20, no. 6, pp. 829-838, 2004. [5] J.R. Nevins and A. Potti, "Mining Gene Expression Profiles: Expression Signatures as Cancer Phenotypes," Nature Rev. Genetics, vol. 8, no. 8, pp. 601-609, 2007. [6] K.Y. Yip, D.W. Cheung, and M.K. Ng, "Harp: A Practical Projected Clustering Algorithm," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004. [7] T.R. Golub et al., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, pp. 531-537, 1999. [8] J. Luo et el., "Human Prostate Cancer and Benign Prostatic Hyperplasia: Molecular Dissection by Gene Expression Profiling." Cancer Research, vol. 61, no. 12, pp. 4683-8, 2001. [9] U. Alon et al., "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays," Proc. Nat'l Academy of Sciences USA, vol. 96, no. 12, pp. 6745-6750, 1999. [10] M. Xiong, X. Fang, and J. Zhao, "Biomarker Identification by Feature Wrappers," Genome Research, vol. 11, no. 11, pp. 1878-1887, 2001. [11] J. Liu and W. Wang, "Op-Cluster: Clustering by Tendency in High Dimensional Space," Proc. IEEE Third Int'l Conf. Data Mining (ICDM), pp. 187-194, 2003. [12] Y. Cheng and G.M. Church, "Biclustering of Expression Data," Proc. Int'l Conf. Intelligent System Moleculer Biology, pp. 93-103, 2000. [13] X. Xu, Y. Lu, and A. Tung, "Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles," Proc. 22nd Int'l Conf. Data Eng. (ICDE '06), pp. 89-100, 2006. [14] A. Ben-Dor, B. Chor, R.M. Karp, and Z. Yakhini, "Discovering Local Structure in Gene Expression Data: the Order-Preserving Submatrix Problem," Proc. Sixth Ann. Int'l Conf. Computational Biology (RECOMB), pp. 49-57, 2002. [15] Q. Fang, W. Ng, and J. Feng, "Discovering Significant Relaxed Order-Preserving Submatrices," Proc. 16th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '10), pp. 433-442, 2010. [16] S. Dong et al., "Histology-Based Expression Profiling Yields Novel Prognostic Markers in Human Glioblastoma," J. Neuropathology and Experimental Neurology, vol. 64, no. 11, pp. 948-955, 2005. [17] H. Liu and H. Motoda, Computational Methods of Feature Selection. CRC Press, 2007. [18] D. Zuckerman, "On Unapproximable Versions of Np-Complete Problems," SIAM J. Computing, vol. 25, no. 6, pp. 1293-1304, 1996. [19] J. Pei et al., "Prefixspan: Mining Sequential Patterns by Prefix-Projected Growth," Proc. 17th Int'l Conf. Data Eng., pp. 215-224, 2001. [20] J. Wang and J. Han, "BIDE: Efficient Mining of Frequent Closed Sequences," Proc. 20th Int'l Conf. Data Eng., pp. 79-90, 2004. [21] D. Lo, S.-C. Khoo, and J. Li, "Mining and Ranking Generators of Sequential Patterns," Proc. SIAM Int'l Conf. Data Mining, pp. 553-564, 2008. [22] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison Wesley, 2005. [23] I. Hedenfalk et al., "Gene-Expression Profiles in Hereditary Breast Cancer," New England J. Medicine, vol. 344, no. 8, pp. 539-548, 2001. [24] E. Müller, S. Günnemann, I. Assent, and T. Seidl, "Evaluating Clustering in Subspace Projections of High Dimensional Data," Proc. VLDB Endowment, vol. 2, no. 1, pp. 1270-1281, 2009. [25] Y. Su, T.M. Murali, V. Pavlovic, M. Schaffer, and S. Kasif, "RankGene: Identification of Diagnostic Genes Based on Expression Data," Bioinformatics, vol. 19, no. 12, pp. 1578-1579, 2003. [26] V. Buccheri and B. Mihaljevic, "mb-1: A New Marker for B-Lineage Lymphoblastic Leukemia," Blood, vol. 82, no. 3, pp. 853-857, 1993. [27] S. Chang et al., "Mechanisms of Regulation of the Macmarcks Gene in Macrophages by Bacterial Lipopolysaccharide," J. Leukocyte Biology, vol. 66, no. 3, pp. 528-534, 1999. [28] Y. Pekarsky, C. Hallas, and C.M. Croce, "The Role of Tcl1 in Human T-Cell Leukemia," Oncogene, vol. 20, no. 40, pp. 5638-5643, 2001. [29] R. Agrawal and R. Srikant, "Mining Sequential Patterns," Proc. 11th Int'l Conf. Data Eng. (ICDE '95), pp. 3-14, 1995. [30] M.J. Zaki, "Spade: An Efficient Algorithm for Mining Frequent Sequences," Machine Learning, vol. 42, no. 1/2, pp. 31-60, 2001. [31] J. Ayres et al., "Sequential Pattern Mining Using a Bitmap Representation," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), pp. 429-435, 2002. [32] X. Yan, J. Han, and R. Afshar, "Clospan: Mining Closed Sequential Patterns in Large Databases," Proc. Third SIAM Int'l Conf. Data Mining, 2003. [33] B. Ding, D. Lo, J. Han, and S.-C. Khoo, "Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database," Proc. Int'l Conf. Data Eng., pp. 1024-1035, 2009. [34] M.J. Zaki, "Sequence Mining in Categorical Domains: Incorporating Constraints," Proc. Ninth Int'l Conf. Information and Knowledge Management (CIKM '00), pp. 422-429, 2000. [35] J. Bailey and G. Dong, Contrast Data Mining: Concepts, Algorithms, and Applications. CRC Press, 2012. [36] G. Dong and J. Li, "Efficient Mining of Emerging Patterns: Discovering Trends and Differences," Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '99), pp. 43-52, 1999. [37] G. Cong et al., "Farmer: Finding Interesting Rule Groups in Microarray Data Sets," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '04), pp. 143-154, 2004. [38] P.K. Novak, N. Lavrac, and G.I. Webb, "Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining," J. Machine Learning Research, vol. 10, pp. 377-403, 2009. [39] Y. Zhao, G. Wang, Y. Li, and Z. Wang, "Finding Novel Diagnostic Gene Patterns Based on Interesting Non-Redundant Contrast Sequence Rules," Proc. IEEE 11th Int'l Conf. Data Mining, pp. 972-981, 2011.