This Article 
 Bibliographic References 
 Add to: 
Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms
January-March 2005 (vol. 2 no. 1)
pp. 62-76
Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k\hbox{-}{\rm{means}} clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k\hbox{-}{\rm{means}} and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.

[1] C. Blaschke , J.C. Oliveros , and A. Valencia , “Mining Functional Information Associated with Expression Arrays,” Functional & Integrative Genomics, vol. 1, pp. 256-268, 2001.
[2] Y. Xu , V. Olman , and D. Xu , “EXCAVATOR: A Computer Program for Efficiently Mining Gene Expression Data,” Nucleic Acids Research, vol. 31, pp. 5582-5589, 2003.
[3] D. Chaussabel and A. Sher , “Mining Microarray Expression Data by Literature Profiling,” Genome Biology, vol. 3, pp. 1-16, 2002.
[4] V. Cherepinsky , J. Feng , M. Rejali , and B. Mishra , “Shrinkage-Based Similarity Metric for Cluster Analysis of Microarray Data,” Proc. Nat'l Academy of Sciences USA, vol. 100, pp. 9668-9673, 2003.
[5] J. Quackenbush , “Computational Analysis of Microarray Data,” Nature Rev. Genetics, vol. 2, pp. 418-427, 2001.
[6] M.B. Eisen , P.T. Spellman , P.O. Brown , and D. Botstein , “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998.
[7] R. Herwig , A.J. Poustka , C. Mller , C. Bull , H. Lehrach , and J. O'Brien , “Large-Scale Clustering of cDNA-Fingerprinting Data,” Genome Research, vol. 9, pp. 1093-1105, 1999.
[8] P. Tamayo , D. Slonim , J. Mesirov , Q. Zhu , S. Kitareewan , E. Dmitrovsky , E.S. Lander , and T.R. Golub , “Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation,” Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 2907-2912, 1999.
[9] A.K. Jain , M.N. Murty , and P.J. Flynn , “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, pp. 264-323, 1999.
[10] S. Raychaudhuri , J.T. Chang , F. Imam , and R.B. Altman , “The Computational Analysis of Scientific Literature to Define and Recognize Gene Expression Clusters,” Nucleic Acids Research, vol. 15, pp. 4553-4560, 2003.
[11] B. Kegl , “Principle Curves: Learning, Design, and Applications,” PhD dissertation, Dept. of Computer Science, Concordia Univ., Montreal, Quebec, 2002.
[12] T.K. Jenssen , A. Laegreid , J. Komorowski , and E. Hovig , “A Literature Network of Human Genes for High-Throughtput Analysis of Gene Expression,” Nat'l Genetics, vol. 178, pp. 139-143, 2001.
[13] D.R. Masys , J.B. Welsh , J.L. Fink , M. Gribskov , I. Klacansky , and J. Corbeil , “Use of Keyword Hierarchies to Interprate Gene Expression Patterns,” Bioinformatics, vol. 17, pp. 319-326, 2001.
[14] S. Raychaudhuri , H. Schutze , and R.B. Altman , “Using Text Analysis to Identify Functionally Coherent Gene Groups,” Genome Research, vol. 12, pp. 1582-1590, 2002.
[15] M. Andrade and A. Valencia , “Automatic Extraction of Keywords from Scientific Text: Application to the Knowledge Domain of Protein Families,” Bioinformatics, vol. 14, pp. 600-607, 1998.
[16] W.T. McCormick , P.J. Schweitzer , and T.W. White , “Problem Decomposition and Data Reorganization by a Clustering Technique,” Operations Research, vol. 20, pp. 993-1009, 1972.
[17] S. Navathe , S. Ceri , G. Wiederhold , and J. Dou , “Vertical Partitioning Algorithms for Database Design,” ACM Trans. Database Systems, vol. 9, pp. 680-710, 1984.
[18] P. Arabie and L.J. Hubert , “The Bond Energy Algorithm Revisited,” IEEE Trans. Systems, Man, and Cybernetics, vol. 20, pp. 268-274, 1990.
[19] A.T. Ozsu and P. Valduriez , Principles of Distributed Database Systems, second ed. Prentice Hall Inc., 1999.
[20] Y. Liu , M. Brandon , S. Navathe , R. Dingledine , and B.J. Ciliax , “Text Mining Functional Keywords Associated with Genes,” Proc. Medinfo 2004, pp. 292-296, Sept. 2004.
[21] Y. Liu , B.J. Ciliax , K. Borges , V. Dasigi , A. Ram , S. Navathe , and R. Dingledine , “Comparison of Two Schemes for Automatic Keyword Extraction from MEDLINE for Functional Gene Clustering,” Proc. IEEE Computational Systems Bioinformatics Conf. (CSB 2004), pp. 394-404, Aug. 2004.
[22] P. Cheeseman and J. Stutz , “Bayesian Classification (Autoclass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, pp. 153-180, AAAI/MIT Press, 1996.
[23] A. Strehl , “Relationship-Based Clustering and Cluster Ensembles for High-Dimensional Data Mining,” PhD dissertation, Dept. of Electric and Computer Eng., The University of Texas at Austin, 2002.
[24] R. Baeza-Yates and B. Ribeiro-Neto , Modern Information Retrieval. New York: Addison Wesley Longman, 1999.
[25] F. Sebastiani , “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, pp. 1-47, 1999.
[26] P. Willett , “Recent Trends in Hierarchic Document Clustering: A Critical Review,” Information Processing and Management, vol. 24, pp. 577-597, 1988.
[27] J. Aslam , A. Leblanc , and C. Stein , “Clustering Data without Prior Knowledge,” Proc. Algorithm Eng.: Fourth Int'l Workshop, 1982.
[28] P.V. Balakrishnan , M.C. Cooper , V.S. Jacob , and P.A. Lewis , “A Study of the Classification Capabilities of Neural Networks Using Unsupervised Learning: A Comparison with K-Means Clustering,” Psychometrika, vol. 59, pp. 509-525, 1994.

Index Terms:
Bond energy algorithm, microarray, MEDLINE, text analysis, cluster analysis, gene function.
Ying Liu, Shamkant B. Navathe, Jorge Civera, Venu Dasigi, Ashwin Ram, Brian J. Ciliax, Ray Dingledine, "Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 1, pp. 62-76, Jan.-March 2005, doi:10.1109/TCBB.2005.14
Usage of this product signifies your acceptance of the Terms of Use.