This Article 
 Bibliographic References 
 Add to: 
Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes
January/February 2012 (vol. 9 no. 1)
pp. 294-304
Yang Xiang, Dept. of Biomed. Inf., Ohio State Univ., Columbus, OH, USA
P. R. O. Payne, Dept. of Biomed. Inf., Ohio State Univ., Columbus, OH, USA
Kun Huang, Dept. of Biomed. Inf., Ohio State Univ., Columbus, OH, USA
Binary (0,1) matrices, commonly known as transactional databases, can represent many application data, including gene-phenotype data where "1” represents a confirmed gene-phenotype relation and "0” represents an unknown relation. It is natural to ask what information is hidden behind these "0”s and "1”s. Unfortunately, recent matrix completion methods, though very effective in many cases, are less likely to infer something interesting from these (0,1)-matrices. To answer this challenge, we propose Ind Evi, a very succinct and effective algorithm to perform independent-evidence-based transactional database transformation. Each entry of a (0,1)-matrix is evaluated by "independent evidence” (maximal supporting patterns) extracted from the whole matrix for this entry. The value of an entry, regardless of its value as 0 or 1, has completely no effect for its independent evidence. The experiment on a gene-phenotype database shows that our method is highly promising in ranking candidate genes and predicting unknown disease genes.

[1] J. Abello, M.G.C. Resende, and S. Sudarsky, “Massive Quasi-Clique Detection,” Proc. Latin Am. Symp. Theoretical Informatics (LATIN), pp. 598-612, 2002.
[2] E.A. Adie, R.R. Adams, K.L. Evans, D.J. Porteous, and B.S. Pickard, “Speeding Disease Gene Discovery by Sequence Based Candidate Prioritization,” BMC Bioinformatics, vol. 6, article 55, 2005.
[3] S. Aerts et al., “Gene Prioritization through Genomic Data Fusion,” Nature Biotechnology, vol. 24, no. 5, pp. 537-544, 2006.
[4] J. Amberger, C.A. Bocchini, A.F. Scott, and A. Hamosh, “Mckusick's Online Mendelian Inheritance in Man (OMIM),” Nucleic Acids Research, vol. 37, pp. D793-D796, 2009.
[5] L. Balzano, R. Nowak, and B. Recht, “Online Identification and Tracking of Subspaces from Highly Incomplete Information,” Proc. Ann. Allerton Conf. Comm., Control, and Computing,, 2010.
[6] A.L. Barabasi, “Network Medicine-From Obesity to the “Diseasome”,” New England J. Medicine, vol. 357, pp. 404-407, 2007.
[7] S. Bentivegna et al., “Rapid Identification of Somatic Mutations in Colorectal and Breast Cancer Tissues Using Mismatch Repair Detection (MRD),” Human Mutation, vol. 29, no. 3, pp. 441-450, 2008.
[8] D. Botstein and N. Risch, “Discovering Genotypes Underlying Human Phenotypes: Past Successes for Mendelian Disease, Future Approaches for Complex Disease,” Nature Genetics, vol. 33, pp. 228-237, 2003.
[9] D. Burdick, M. Calimlim, J. Flannick, J. Gehrke, and T. Yiu, “Mafia: A Maximal Frequent Itemset Algorithm,” IEEE Trans. Knowledge Data Eng., vol. 17, no. 11, pp. 1490-1504, Nov. 2005.
[10] J.F. Cai, E.J. Candes, and Z. Shen, “A Singular Value Thresholding Algorithm for Matrix Completion,” SIAM J. Optimization, vol. 20, pp. 1956-1982, 2010.
[11] E.J. Candès and B. Recht, “Exact Matrix Completion via Convex Optimization,” Foundations of Computational Math., vol. 9, no. 6, pp. 717-772, 2009.
[12] J. Chen, B.J. Aronow, and A.G. Jegga, “Disease Candidate Gene Identification and Prioritization Using Protein Interaction Networks,” BMC Bioinformatics, vol. 10, article 73, 2009.
[13] J. Chen, E.E. Bardes, B.J. Aronow, and A.G. Jegga, “ToppGene Suite for Gene List Enrichment Analysis and Candidate Gene Prioritization,” Nucleic Acids Research, vol. 37, pp. 305-311, 2009.
[14] W. Chen et al., “Targets of Genome Copy Number Reduction in Primary Breast Cancers Identified by Integrative Genomics,” Genes, Chromosomes and Cancer, vol. 46, no. 3, pp. 288-301, 2007.
[15] W. Dai and O. Milenkovic, “Set: An Algorithm for Consistent Matrix Completion,” Proc. IEEE Int'l Conf. Acoustics Speech and Signal Processing, pp. 3646-3649, 2010.
[16] H. Davies et al., “Mutations of the BRAF Gene in Human Cancer,” Nature, vol. 417, no. 6892, pp. 949-954, 2002.
[17] P. Du, G. Feng, J. Flatow, J. Song, M. Holko, W.A. Kibbe, and S.M. Lin, “From Disease Ontology to Disease-Ontology Lite: Statistical Methods to Adapt a General-Purpose Ontology for the Test of Gene-Ontology Associations,” Bioinformatics, vol. 25, no. 12, pp. i63-i68, 2009.
[18] G. Feng, P. Du, N.L. Krett, M. Tessel, S. Rosen, W.A. Kibbe, and S.M. Lin, “A Collection of Bioconductor Methods to Visualize Gene-List Annotations,” BMC Research Notes, vol. 3, article 10, 2010.
[19] L. Franke, H. Bakel, L. Fokkens, E.D. de Jong, M. Egmont-Petersen, and C. Wijmenga, “Reconstruction of a Functional Human Gene Network, with an Application for Prioritizing Positional Candidate Genes,” The Am. J. Human Genetics, vol. 78, no. 6, pp. 1011-1025, 2006.
[20] J. Freudenberg and P. Propping, “A Similarity-Based Method for Genome-Wide Prediction of Disease-Relevant Human Genes,” Bioinformatics, vol. 18, no. Suppl 2, pp. S110-S115, 2002.
[21] K.J. Gaulton, K.L. Mohlke, and T.J. Vision, “A Computational System to Select Candidate Genes for Complex Human Traits,” Bioinformatics, vol. 23, no. 9, pp. 1132-1140, 2007.
[22] F. Geerts, B. Goethals, and T. Mielikäinen, “Tiling Databases,” Proc. Seventh Int'l Conf. Discovery Science, pp. 278-289, 2004.
[23] A. Gionis, H. Mannila, and J.K. Seppänen, “Geometric and Combinatorial Tiles in 0-1 Data,” Proc. European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 173-184, 2004.
[24] K.I. Goh, M.E. Cusick, D. Valle, B. Childs, M. Vidal, and A.L. Barabási, “The Human Disease Network,” Proc. Nat'l Academy of Sciences USA, vol. 104, no. 21, pp. 8685-8690, 2007.
[25] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2006.
[26] J.A. Hartigan, “Direct Clustering of a Data Matrix,” J. Am. Statistical Assoc., vol. 67, no. 337, pp. 123-129, 1972.
[27] S. Ji and J. Ye, “An Accelerated Gradient Method for Trace Norm Minimization,” Proc. Ann. Int'l Conf. Machine Learning (ICML), pp. 457-464, 2009.
[28] R. Jin, Y. Xiang, H. Hong, and K. Huang, “Block Interaction: A Generative Summarization Scheme for Frequent Patterns,” UP '10: Proc. ACM SIGKDD Workshop Useful Patterns, pp. 55-64, 2010.
[29] S. Karni, H. Soreq, and R. Sharan, “A Network-Based Method for Predicting Disease-Causing Genes,” J. Computational Biology, vol. 16, no. 2, pp. 181-189, 2009.
[30] R. Karp, “Reducibility among Combinatorial Problems,” Complexity of Computer Computations, R. Miller and J. Thatcher, ed., pp. 85-103, Plenum Press, 1972.
[31] R.H. Keshavan, S. Oh, and A. Montanari, “Matrix Completion from a Few Entries,” ISIT '09: Proc. IEEE Int'l Conf. Symp. Information Theory, pp. 324-328, 2009.
[32] Z. Kutalik, J.S. Beckmann, and S. Bergmann, “A Modular Approach for Integrative Analysis of Large-Scale Gene-Expression and Drug-Response Data,” Nature Biotechnology, vol. 26, no. 5, pp. 531-539, 2008.
[33] K. Lage et al., “A Human Phenome-Interactome Network of Protein Complexes Implicated in Genetic Disorders,” Nature Biotechnology, vol. 25, no. 3, pp. 309-316, 2007.
[34] V.E. Lee, N. Ruan, R. Jin, and C. Aggarwal, “A Survey of Algorithms for Dense Subgraph Discovery,” Managing and Mining Graph Data, pp. 303-336, Springer, 2010.
[35] J. Li, G. Liu, H. Li, and L. Wong, “Maximal Biclique Subgraphs and Closed Pattern Pairs of the Adjacency Matrix: A One-to-One Correspondence and Mining Algorithms,” IEEE Trans. Knowledge Data Eng., vol. 19, no. 12, pp. 1625-1637, Dec. 2007.
[36] J. Li, K. Sim, G. Liu, and L. Wong, “Maximal Quasi-Bicliques with Balanced Noise Tolerance: Concepts and Co-Clustering Applications,” Proc. SIAM Int'l Conf. Data Mining (SDM), pp. 72-83, 2008.
[37] B. Linghu, E.S. Snitkin, Z. Hu, Y. Xia, and C. DeLisi, “Genome-Wide Prioritization of Disease Genes and Identification of Disease-Disease Associations from an Integrated Human Functional Linkage Network,” Genome Biology, vol. 10, no. 9,article R91, 2009.
[38] J. Loscalzo, I. Kohane, and A.L. Barabasi, “Human Disease Classification in the Postgenomic Era: A Complex Systems Approach to Human Pathobiology,” Molecular Systems Biology, vol. 3, article 124, 2007.
[39] R. Macek, K. Swisshelm, and M. Kubbies, “Expression and Function of Tight Junction Associated Molecules in Human Breast Tumor Cells Is Not Affected by the Ras-MEK1 Pathway,” Cellular and Molecular Biology, vol. 49, no. 1, pp. 1-11, 2003.
[40] V.A. McKusick, “Mendelian Inheritance in Man and Its Online Version, Omim,” Am. J. Human Genetics, vol. 80, no. 4, pp. 588-604, 2007.
[41] B. Mirkin, Mathematical Classification and Clustering. Kluwer Academic Publishers, 1996.
[42] G. Monteleone et al., “Silencing of SH-PTP2 Defines a Crucial Role in the Inactivation of Epidermal Growth Factor Receptor by 5-Aminosalicylic Acid in Colon Cancer Cells,” Cell Death & Differentiation, vol. 13, no. 2, pp. 202-211, 2005.
[43] A.K. Murugan, J. Dong, J. Xie, and M. Xing, “MEK1 Mutations, but Not ERK2 Mutations, Occur in Melanomas and Colon Carcinomas, but None in Thyroid Carcinomas,” Cell Cycle (Georgetown, Tex.), vol. 8, no. 13, pp. 2122-2124, 2009.
[44] R.A. Mushlin, S. Gallagher, A. Kershenbaum, and T.R. Rebbeck, “Clique-Finding for Heterogeneity and Multidimensionality in Biomarker Epidemiology Research: The Chamber Algorithm,” PloS one, vol. 4, no. 3,e4862, 2009.
[45] R. Peeters, “The Maximum Edge Biclique Problem is NP-Complete,” Discrete Applied Math., vol. 131, no. 3, pp. 651-654, 2003.
[46] C. Perez-Iratxeta, P. Bork, and M.A. Andrade, “Association of Genes to Genetically Inherited Diseases Using Data Mining,” Nature Genetics, vol. 31, no. 3, pp. 316-319, 2002.
[47] M.G. Ravetti and P. Moscato, “Identification of a 5-Protein Biomarker Molecular Signature for Predicting Alzheimer's Disease,” PLoS One, vol. 3, no. 9,e3111, 2008.
[48] P.N. Robinson, S. Köhler, S. Bauer, D. Seelow, D. Horn, and S. Mundlos, “The Human Phenotype Ontology: A Tool for Annotating and Analyzing Human Hereditary Disease,” The Am. J. Human Genetics, vol. 83, no. 5, pp. 610-615, 2008.
[49] P.N. Robinson and S. Mundlos, “The Human Phenotype Ontology,” Clinical Genetics, vol. 77, no. 6, pp. 525-534, 2010.
[50] S.B. Seidman, “Network Structure and Minimum Degree* 1,” Social Networks, vol. 5, no. 3, pp. 269-287, 1983.
[51] S.B. Seidman and B.L. Foster, “A Graph-Theoretic Generalization of the Clique Concept,” The J. Math. Sociology, vol. 6, no. 1, pp. 139-154, 1978.
[52] J. Shama, R. Garcia-Medina, J. Pouysségur, and E. Vial, “Major Contribution of MEK1 to the Activation of ERK1/ERK2 and to the Growth of LS174T Colon Carcinoma Cells,” Biochemical and Biophysical Research Comm., vol. 372, no. 4, pp. 845-849, 2008.
[53] J. Sun, P. Jia, A.H. Fanous, B.T. Webb, E.J.C.G. van den Oord, X. Chen, J. Bukszar, K.S. Kendler, and Z. Zhao, “A Multi-Dimensional Evidence-Based Candidate Gene Prioritization Approach for Complex Diseases-Schizophrenia as a Case,” Bioinformatics, vol. 25, no. 19, pp. 2595-6602, 2009.
[54] K. Truninger et al., “Immunohistochemical Analysis Reveals High Frequency of PMS2 Defects in Colorectal Cancer,” Gastroenterology, vol. 128, no. 5, pp. 1160-1171, 2005.
[55] F.S. Turner, D.R. Clutterbuck, and C.A.M. Semple, “Pocus: Mining Genomic Sequence Annotation to Predict Disease Genes,” Genome Biology, vol. 4, no. 11,article R75, 2003.
[56] T. Uno, M. Kiyomi, and H. Arimura, “Lcm Ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets,” Proc. IEEE ICDM Workshop Frequent Itemset Mining Implementations (FIMI), 2004.
[57] M.A. van Driel, K. Cuelenaere, P.P.C.W. Kemmeren, J.A.M. Leunissen, and H.G. Brunner, “A New Web-Based Data Mining Tool for the Identification of Candidate Genes for Human Genetic Disorders,” European J. Human Genetics, vol. 11, no. 1, pp. 57-63, 2003.
[58] X. Wu, R. Jiang, M.Q. Zhang, and S. Li, “Network-Based Global Inference of Human Disease Genes,” Molecular Systems Biology, vol. 4, article 189, 2008.
[59] Y. Xiang, R. Jin, D. Fuhry, and F.F. Dragan, “Succinct Summarization of Transactional Databases: An Overlapped Hyperrectangle Scheme,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 758-766, 2008.

Index Terms:
genetics,bioinformatics,diseases,maximal supporting patterns,transactional database transformation,human disease genes,binary (0,1) matrices,gene-phenotype database,matrix completion methods,independent evidence,Diseases,Bipartite graph,Itemsets,Bioinformatics,Feature extraction,Data mining,matrix completion.,Transactional database,binary matrix,frequent item set mining,maximal biclique,phenotype,disease gene,prioritization
Yang Xiang, P. R. O. Payne, Kun Huang, "Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 1, pp. 294-304, Jan.-Feb. 2012, doi:10.1109/TCBB.2011.58
Usage of this product signifies your acceptance of the Terms of Use.