This Article 
 Bibliographic References 
 Add to: 
Data Mining on DNA Sequences of Hepatitis B Virus
March/April 2011 (vol. 8 no. 2)
pp. 428-440
Kwong-Sak Leung, The Chinese University of Hong Kong, Hong Kong
Kin Hong Lee, The Chinese University of Hong Kong, Hong Kong
Jin-Feng Wang, The Chinese University of Hong Kong, Hong Kong
Eddie Y.T. Ng, The Chinese University of Hong Kong, Hong Kong
Henry L.Y. Chan, The Chinese University of Hong Kong, Hong Kong
Stephen K.W. Tsui, The Chinese University of Hong Kong, Hong Kong
Tony S.K. Mok, The Chinese University of Hong Kong, Hong Kong
Pete Chi-Hang Tse, The Chinese University of Hong Kong, Hong Kong
Joseph Jao-Yiu Sung, The Chinese University of Hong Kong, Hong Kong
Extraction of meaningful information from large experimental data sets is a key element in bioinformatics research. One of the challenges is to identify genomic markers in Hepatitis B Virus (HBV) that are associated with HCC (liver cancer) development by comparing the complete genomic sequences of HBV among patients with HCC and those without HCC. In this study, a data mining framework, which includes molecular evolution analysis, clustering, feature selection, classifier learning, and classification, is introduced. Our research group has collected HBV DNA sequences, either genotype B or C, from over 200 patients specifically for this project. In the molecular evolution analysis and clustering, three subgroups have been identified in genotype C and a clustering method has been developed to separate the subgroups. In the feature selection process, potential markers are selected based on Information Gain for further classifier learning. Then, meaningful rules are learned by our algorithm called the Rule Learning, which is based on Evolutionary Algorithm. Also, a new classification method by Nonlinear Integral has been developed. Good performance of this method comes from the use of the fuzzy measure and the relevant nonlinear integral. The nonadditivity of the fuzzy measure reflects the importance of the feature attributes as well as their interactions. These two classifiers give explicit information on the importance of the individual mutated sites and their interactions toward the classification (potential causes of liver cancer in our case). A thorough comparison study of these two methods with existing methods is detailed. For genotype B, genotype C subgroups C1, C2, and C3, important mutation markers (sites) have been found, respectively. These two classification methods have been applied to classify never-seen-before examples for validation. The results show that the classification methods have more than 70 percent accuracy and 80 percent sensitivity for most data sets, which are considered high as an initial scanning method for liver cancer diagnosis.

[1] R.P. Beasley, L.Y. Hwang, C.C. Lin, and C.S. Chien, "Hepatocellular Carcinoma and Hepatitis B Virus. A Prospective Study of 22 707 Men in Taiwan," Lancet, vol. 2, pp. 1129-1133, 1981.
[2] J.H. Kao, P.J. Chen, M.Y. Lai, and D.S. Chen, "Hepatitis B Genotypes Correlate with Clinical Outcome in Patients with Chronic Hepatitis B," Gastroenterology, vol. 118, pp. 554-559, 2000.
[3] H.L.Y. Chan et al., "Genotype C Hepatitis B Virus Infection Is Associated with an Increased Risk of Hepatocellular Carcinoma," Gut, vol. 53, pp. 1494-1498, 2004.
[4] H. Sumi, O. Yokosuka, N. Seki, M. Arai, F. Imazeki, T. Kurihara, T. Kanda, K. Fukai, M. Kato, and H. Saisho, "Influence of Hepatitis B Virus Genotypes on the Progression of Chronic Liver Disease," Hepatology, vol. 37, pp. 19-26, 2003.
[5] M.F. Yuen, Y. Tanaka, M. Mizokami, J.C. Yuen, D.K. Wong, H.J. Yuan, S.M. Sum, A.O. Chan, B.C. Wong, and C.L. Lai, "Role of Hepatitis B Virus Genotypes Ba and C, Core Promoter and Precore Mutations on Hepatocellular Carcinoma: A Case Control Study," Carcinogenesis, vol. 25, pp. 1593-1598, 2004.
[6] H.L.Y. Chan, C.H. Tse, E.Y.T. Ng, K.S. Leung, K.H. Lee, K.W. Tsui, and J.J.Y. Sung, "Phylogenetic, Virological and Clinical Characteristics of Genotype C Hepatitis B Virus with Tcc at Codon 15 of the Precore Region," J. Clinical Microbiology, vol. 44, no. 3, pp. 681-687, 2006.
[7] H.L.Y. Chan, S.K.W. Tsui, E.Y.T. NG, P.C.H. Tse, K.S. Leung, K.H. Lee, T. Mok, A. Bartholomeuz, T.C.C. Au, and J.J.Y. Song, "Epidemiological and Virological Characteristics of Two Subgroups of Genotype C Hepatitis Virus," J. Infectious Diseases, vol. 191, pp. 2022-2032, 2005.
[8] T. Laskus, L.-F. Wang, M.R.H. Vargas, and J. Cianciara, "Comparison of Hepatitis B Virus Core Promoter Sequences in Peripheral Blood Mononuclear Cells and Serum From Patients with Hepatitis B," J. General Virology, vol. 78, pp. 649-653, 1997.
[9] W.K. Keum, J.Y. Kim, J.Y. Kim, S.G. Chi, H.J. Woo, S.S. Kim, J. Ha, and I. Kang, "Heterogeneous HBV Mutants Coexist in Korean Hepatitis B Patients," Experimental and Molecular Medicine, vol. 30, no. 2, pp. 115-122, June 1998.
[10] R.B. Potter and S. Draghici, "A Soft Approach to Predicting HIV Drug Resistance," Proc. Pacific Symp. Biocomputing (PSB '02), 2002.
[11] A. Ciancio, A. Smedile, and M. Rizzetto, "Identification of HBV DNA Sequences that Are Predictive of Response to Lamivudine Therapy," Hepatology, vol. 39, pp. 64-73, 2004.
[12] J.D. Thompson, D.G. Higgins, and T.J. Gibson, "CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position Specific Gap Penalties and Weight Matrix Choice," Nucleic Acids Research, vol. 22, pp. 4673-4680, 1994.
[13] M. Kimura, "A Simple Method for Estimating Evolutionary Rates of Base Substitutions through Comparative Studies of Nucleotide Sequences," J. Molecular Evolution, vol. 16, pp. 111-120, 1980.
[14] N. Saitou and M. Nei, "The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees," Molecular Biology and Evolution, vol. 4, pp. 406-425, 1987.
[15] S. Kumar, K. Tamura, and M. Nei, "MEGA3: Integrated Software for Molecular Evolutionary Genetics Analysis and Sequence Alignment," Brief Bioinformatics, vol. 5, pp. 150-163, 2004.
[16] T. Sakamoto, Y. Tanaka, J. Simonetti, C. Osiowy, M.L. Børresen, A. Koch, F. Kurbanov, M. Sugiyama, G.Y. Minuk, B.J. McMahon, T. Joh, and M. Mizokami, "Classification of Hepatitis B Virus Genotype B into 2 Major Types Based on Characterization of a Novel Subgenotype in Arctic Indigenous Populations," J. Infectious Diseases, vol. 196, pp. 1487-1492, 2007.
[17] E. Orito et al., "Geographic Distribution of Hepatitis B Virus (HBV) Genotype in Patients with Chronic HBV Infection in Japan," Hepatology, vol. 34, pp. 590-594, 2001.
[18] F. Sugauchi, H. Kumada, H. Sakugawa, M. Komatsu, H. Niitsuma, H. Watanabe, Y. Akahane, H. Tokita, T. Kato, Y. Tanaka, E. Orito, R. Ueda, Y. Miyakawa, and M. Mizokami, "Two Subtypes of Genotype B (Ba and Bj) of Hepatitis B Virus in Japan," Clinical Infectious Diseases, vol. 38, pp. 1222-1228, 2004.
[19] S.M. Owyer and J.G.M. Sim, "Relationships within and between Genotypes of Hepatitis B Virus at Points Across the Genome: Footprints of Recombination in Certain Isolates," J. General Virology, vol. 81, pp. 379-392, 2000.
[20] H. Almuallim and T. Dietterich, "Learning Boolean Concepts in the Presence of Many Irrelevant Features," Artificial Intelligence, vol. 69, nos. 1/2, pp. 179-305, 1994.
[21] M. Dash and H. Liu, "Feature Selection for Classification," Intelligent Data Analysis, vol. 1, no. 3, pp. 131-156, 1997.
[22] G. John, R. Kohavi, and K. Pdlwfwe, "Irrelevant Features and the Subset Selection Problem," Proc. 11th Int'l Conf. Machine Learning, pp. 121-129, 1994.
[23] P. Langley, "Selection of Relevant Features in Machine Learning," Proc. AAAI Fall Symp. Relevance, pp. 1-5, 1994.
[24] P. Pudil, J. Novovicoca, and J. Kittler, "Floating Search Methods in Feature Selection," Pattern Recognition Letters, vol. 15, pp. 1119-1125, Nov. 1994.
[25] A. Jain and D. Zongker, "Feature Selection: Evaluation, Application, and Small Example Performance," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp. 153-158, Feb. 1997.
[26] T.M. Mitchell, Machine Learning. The Mc-Graw-Hill Companies, Inc., 1997.
[27] C. Eugene, "Bayesian Network without Tears," AI Magazine, vol. 12, no. 4, pp. 50-63, 1991.
[28] D.M. Chickering, D. Heckerman, and D. Geiger, "Learning Bayesian Networks: The Combination of Knowledge and Statistical Data," Machine Learning, vol. 20, pp. 197-243, 1995.
[29] W. Liu, J. Cheng, and A.B. David, "An Algorithm for Bayesian Belief Network Construction from Data," Proc. Sixth Int'l Workshop Artificial Intelligence and Statistics, 1997.
[30] W. Banzaf, P. Nordin, R. Keller, and F. Francone, Genetic Programming—An Introduction. Morgan Kaufmann, 1997.
[31] A.A. Freitas, "A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery," Advances in Evolutionary Computation, A. Ghosh and S. Tsutsui, eds., Springer-Verlag, 2002.
[32] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain, "Dimensionality Reduction Using Genetic Algorithms," IEEE Trans. Evolutionary Computing, vol. 4, no. 2, pp. 164-171, July 2000.
[33] M.L. Wong and K.S. Leung, "Learning Recursive Functions from Noisy Examples Using Generic Genetic Programming," Proc. First Ann. Conf., pp. 238-246, 1996.
[34] M.L. Wong and K.S. Leung, Data Mining Using Grammar Based Genetic Programming and Applications. Kluwer Academic Publishers, Jan. 2000.
[35] K.B. Xu, Z.Y. Wang, P.A. Heng, and K.S. Leung, "Classification by Nonlinear Integral Projections," IEEE Trans. Fuzzy Systems, vol. 11, no. 2, pp. 187-201, Apr. 2003.
[36] Z.Y. Wang, K.S. Leung, and J. Wang, "A Genetic Algorithm for Determining Nonadditive Set Functions in Information Fusion," Fuzzy Sets and Systems, vol. 102, pp. 463-469, 1999.
[37] G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. Wiley, 1992.
[38] S. Mika, A.J. Smola, and B. Schölkopf, "An Improved Training Algorithm for Fisher Kernel Discriminants," Proc. Artificial Intelligence and Statistics (AISTATS '01), T. Jaakkaola and T. Richardson, eds., pp. 98-104, 2001.
[39] M.L. Wong and K.S. Leung, "Genetic Logic Programming and Applications," IEEE Expert, vol. 10, no. 5, pp. 68-76, Oct. 1995.
[40] Data Mining Tools See5 and C5.0, Software, http://www. rulequest.comsee5-info.html, May 2006.
[41] SAS® Enterprise Miner (EM), http:/ dataminingminer/, 2009.
[42] C.C Chang and C.J. Lin, "LIBSVM: A Library for Support Vector Machines," Software,, 2001.
[43] C. Borgelt, Bayes Classifier Induction, Software, http://fuzzy. bayes.html, 2009.
[44] H. Zhang, "The Optimality of Naive Bayes," Proc. 17th Int'l Florida Alliance of Information and Referral Services (FLAIRS) Conf., 2004.
[45] C.M. Van Der Walt and E. Barnard, "Data Characteristics That Determine Classifier Performance," Proc. 16th Ann. Symp. Pattern Recognition Assoc. of South Africa, pp. 160-165, http:/www., 2006.
[46] K.S. Leung, Y.T. Ng, K.H. Lee, L.Y. Chan, K.W. Tsui, T. Mok, C.H. Tse, and J. Sung, "Data Mining on DNA Sequences of Hepatitis B Virus by Nonlinear Integrals," Proc. Taiwan-Japan Symp. Fuzzy Systems & Innovational Computing, Third Meeting (Keynote Speech), pp. 1-10, Aug. 2006.
[47] M.Y. Park, and T. Hastie, "L1-Regularization Path Algorithm for Generalized Linear Models," J. Royal Statistical Soc.: Series B (Statistical Methodology), vol. 69, no. 4, pp. 659-677, 2007.

Index Terms:
Data mining, DNA sequences of HBV, mutation sites, nonlinear integrals, rule learning, the signed fuzzy measures.
Kwong-Sak Leung, Kin Hong Lee, Jin-Feng Wang, Eddie Y.T. Ng, Henry L.Y. Chan, Stephen K.W. Tsui, Tony S.K. Mok, Pete Chi-Hang Tse, Joseph Jao-Yiu Sung, "Data Mining on DNA Sequences of Hepatitis B Virus," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 2, pp. 428-440, March-April 2011, doi:10.1109/TCBB.2009.6
Usage of this product signifies your acceptance of the Terms of Use.