This Article 
 Bibliographic References 
 Add to: 
Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification
March/April 2011 (vol. 8 no. 2)
pp. 316-325
Sangyoon Oh, Ajou University, Suwon
Min Su Lee, Seoul National University, Seoul
Byoung-Tak Zhang, Seoul National University, Seoul
In biomedical data, the imbalanced data problem occurs frequently and causes poor prediction performance for minority classes. It is because the trained classifiers are mostly derived from the majority class. In this paper, we describe an ensemble learning method combined with active example selection to resolve the imbalanced data problem. Our method consists of three key components: 1) an active example selection algorithm to choose informative examples for training the classifier, 2) an ensemble learning method to combine variations of classifiers derived by active example selection, and 3) an incremental learning scheme to speed up the iterative training procedure for active example selection. We evaluate the method on six real-world imbalanced data sets in biomedical domains, showing that the proposed method outperforms both the random under sampling and the ensemble with under sampling methods. Compared to other approaches to solving the imbalanced data problem, our method excels by 0.03-0.15 points in AUC measure.

[1] N.V. Chawla, N. Japkowicz, and A. Kolcz, "Editorial: Special Issue on Learning from Imbalanced Data Sets," ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 1-6, June 2004.
[2] G.M. Weiss, "Mining with Rarity: A Unifying Framework," ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 7-19, June 2004.
[3] N. Japkowicz and R. Holte, "Workshop Report: AAAI-2000 Workshop Learning from Imbalanced Data Sets," AI Magazine, vol. 22, no. 1, pp. 127-136, 2001.
[4] ICML 2003 Workshop Learning from Imbalanced Data Sets (II), workshop2003.html, 2010.
[5] N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer, "SMOTE: Synthetic Minority Over-Sampling Technique," J. Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[6] M. Kubat and S. Matwin, "Addressing the Curse of Imbalanced Training Sets: One Sided Selection," Proc. 14th Int'l Conf. Machine Learning, pp. 179-186, 1997.
[7] J.V. Hulse, T.M. Khoshgoftaar, and A. Napolitano, "Experimental Perspectives on Learning from Imbalanced Data," Proc. 24th Int'l Conf. Machine Learning, pp. 935-942, 2007.
[8] G.E.A.P.A. Batista, R.C. Prati, and M.C. Monard, "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data," ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 20-29, June 2004.
[9] S. Hido and H. Kashima, "Roughly Balanced Bagging for Imbalanced Data," Proc. SIAM Int'l Conf. Data Mining, pp. 143-152, 2008.
[10] P. Kang and S. Cho, "EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems," Lecture Notes in Computer Science, pp. 837-846, Springer, Oct. 2006.
[11] S. Ertekin, J. Huang, L. Bottou, and C.L. Giles, "Learning on the Border: Active Learning in Imbalanced Data Classification," Proc. ACM Conf. Information and Knowledge Management (CIKM '07), pp. 127-136, Nov. 2007.
[12] S. Ertekin, J. Huang, and C.L. Giles, "Active Learning for Class Imbalance Problem," Proc. ACM SIGIR '07), pp. 823-824, July 2007.
[13] L.A. Kurgan, K.J. Cios, R. Tadeusiewicz, M. Ogiela, and L. Goodenday, "Knowledge Discovery Approach to Automated Cardiac SPECT Diagnosis," Artificial Intelligence in Medicine, vol. 23, no. 2, pp. 149-169, Oct. 2001.
[14] H. Liu, H. Han, J. Li, and L. Wong, "An In-Silico Method for Prediction of Polyadenylation Signals in Human Sequences," Proc. 14th Int'l Conf. Genome Informatics, vol. 14, pp. 84-93, Dec. 2003.
[15] R.J. Dobson, P.B. Munroe, M.J. Caulfield, and M.A.S. Saqi, "Predicting Deleterious nsSNPs: An Analysis of Sequence and Structural Attributes," BMC Bioinformatics, vol. 7, pp. 217-225, 2006.
[16] G.-Z. Li, H.-H. Meng, W.-C. Lu, J.Y. Yang, and M.Q. Yang, "Asymmetric Bagging and Feature Selection for Activities Prediction of Drug Molecules," BMC Bioinformatics, vol. 9, suppl. 6, p. S7, Aug. 2007.
[17] C. Caragea, J. Sinapov, A. Silvescu, D. Dobbs, and V. Honavar, "Glycosylation Site Prediction Using Ensembles of Support Vector Machine Classifiers," BMC Bioinformatics, vol. 8, pp. 438-450, Nov. 2007.
[18] M.S. Lee, J.-K. Rhee, B.-H. Kim, and B.-T. Zhang, "AESNB: Active Example Selection with Naïve Bayes Classifier for Learning from Imbalanced Biomedical Data," Proc. IEEE Int'l Conf. Bioinformatics and Bioeng., pp. 15-21, 2009.
[19] E. Bauer and R. Kohavi, "An Empirical Comparison of Voting Classification 37 Algorithms: Bagging, Boosting, and Variants," Machine Learning, vol. 36, nos. 1/2, pp. 105-139, 1999.
[20] C. Giraud-Carrier, "A Note on the Utility of Incremental Learning," AI Comm., vol. 13, no. 4, pp. 215-223, Dec. 2000.
[21] W.L. Buntine, "Operations for Learning with Graphical Models," J. Artificial Intelligence Research, vol. 2, pp. 159-225, 1994.
[22] A. Asuncion and D.J. Newman UCI Machine Learning Repository, , 2007.
[23] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, pp. 531-537, 1999.
[24] P. Diaconis and B. Efron, "Computer-Intensive Methods in Statistics," Scientific Am., vol. 248, pp. 116-128, 1983.
[25] G. Cestnik, I. Konenenko, and I. Bratko, "Assistant-86: A Knowledge Elicitation Tool for Sophisticated Users," Progress in Machine Learning, I. Bratko and N. Lavrac, eds., pp. 31-45, Sigma Press, 1987.
[26] M.A. Little, P.E. McSharry, S.J. Roberts, D.A.E. Costello, and I.M. Moroz, "Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection," BioMedical Eng. OnLine, vol. 6, no. 23, pp. 23-42, June 2007.
[27] J.W. Smith, J.E. Everhart, W.C. Dickson, W.C. Knowler, and R.S. Johannes, "Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus," Proc. Symp. Computer Applications and Medical Care, pp. 261-265, 1988.
[28] W.N. Street, O.L. Mangasarian, and W.H. Wolberg, "An Inductive Learning Approach to Prognostic Prediction," Proc. Int'l Conf. Machine Learning, pp. 522-530, 1995.

Index Terms:
Bioinformatics, classification, interactive data exploration and discovery, mining methods and algorithms.
Sangyoon Oh, Min Su Lee, Byoung-Tak Zhang, "Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 2, pp. 316-325, March-April 2011, doi:10.1109/TCBB.2010.96
Usage of this product signifies your acceptance of the Terms of Use.