The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - Oct. (2012 vol.24)
pp: 1904-1916
Indranil Palit , University of Notre Dame, Notre Dame
Chandan K. Reddy , Wayne State University, Detroit
ABSTRACT
In this era of data abundance, it has become critical to process large volumes of data at much faster rates than ever before. Boosting is a powerful predictive model that has been successfully used in many real-world applications. However, due to the inherent sequential nature, achieving scalability for boosting is nontrivial and demands the development of new parallelized versions which will allow them to efficiently handle large-scale data. In this paper, we propose two parallel boosting algorithms, AdaBoost.PL and LogitBoost.PL, which facilitate simultaneous participation of multiple computing nodes to construct a boosted ensemble classifier. The proposed algorithms are competitive to the corresponding serial versions in terms of the generalization performance. We achieve a significant speedup since our approach does not require individual computing nodes to communicate with each other for sharing their data. In addition, the proposed approach also allows for preserving privacy of computations in distributed environments. We used MapReduce framework to implement our algorithms and demonstrated the performance in terms of classification accuracy, speedup and scaleup using a wide variety of synthetic and real-world data sets.
INDEX TERMS
Boosting, Prediction algorithms, Algorithm design and analysis, Convergence, Distributed databases, Training, Computational modeling, MapReduce., Boosting, parallel algorithms, classification, distributed computing
CITATION
Indranil Palit, Chandan K. Reddy, "Scalable and Parallel Boosting with MapReduce", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 10, pp. 1904-1916, Oct. 2012, doi:10.1109/TKDE.2011.208
REFERENCES
[1] Y. Freund and R.E. Schapire, "Experiments with a New Boosting Algorithm," Proc. Int'l Conf. Machine Learning (ICML), pp. 148-156, 1996.
[2] L. Breiman, "Bagging Predictors," Machine Learning, vol. 24, no. 2, pp. 123-140, 1996.
[3] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
[4] D.W. Opitz and R. Maclin, "Popular Ensemble Methods: An Empirical Study," J. Artificial Intelligence Research, vol. 11, pp. 169-198, 1999.
[5] J. Dean and S. Ghemawat, "Mapreduce: Simplified Data Processing on Large Clusters," Comm. ACM, vol. 51, no. 1, pp. 107-113, 2008.
[6] S. Gambs, B. Kégl, and E. Aïmeur, "Privacy-Preserving Boosting," Data Mining Knowledge Discovery, vol. 14, no. 1, pp. 131-170, 2007.
[7] Y. Freund and R.E. Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting," J. Computer and System Science, vol. 55, no. 1, pp. 119-139, 1997.
[8] J. Friedman, T. Hastie, and R. Tibshirani, "Additive Logistic Regression: A Statistical View of Boosting," The Annals of Statistics, vol. 38, no. 2, pp. 337-407, 2000.
[9] J.K. Bradley and R.E. Schapire, "Filterboost: Regression and Classification on Large Datasets," Proc. Advances in Neural Information and Processing Systems (NIPS), pp. 185-192, 2007.
[10] M. Collins, R.E. Schapire, and Y. Singer, "Logistic Regression, Adaboost and Bregman Distances," Machine Learning, vol. 48, nos. 1-3, pp. 253-285, 2002.
[11] G. Escudero, L. Màrquez, and G. Rigau, "Boosting Applied Toe Word Sense Disambiguation," Proc. European Conf. Machine Learning (ECML), pp. 129-141, 2000.
[12] R. Busa-Fekete and B. Kégl, "Bandit-Aided Boosting," Proc. Second NIPS Workshop Optimization for Machine Learning, 2009.
[13] G. Wu, H. Li, X. Hu, Y. Bi, J. Zhang, and X. Wu, "Mrec4.5: C4.5 Ensemble Classification with Map-Reduce," Proc. Fourth ChinaGrid Ann. Conf., pp. 249-255, 2009.
[14] B. Panda, J. Herbach, S. Basu, and R.J. Bayardo, "Planet: Massively Parallel Learning of Tree Ensembles with Mapreduce," Proc. VLDB Endowment, vol. 2, no. 2, pp. 1426-1437, 2009.
[15] A. Lazarevic and Z. Obradovic, "Boosting Algorithms for Parallel and Distributed Learning," Distributed and Parallel Databases, vol. 11, no. 2, pp. 203-229, 2002.
[16] W. Fan, S.J. Stolfo, and J. Zhang, "The Application of Adaboost for Distributed, Scalable and On-Line Learning," Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 362-366, 1999.
[17] R. Caruana and A. Niculescu-Mizil, "Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria," Proc. 10th Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), 2004.
[18] R.E. Schapire and Y. Singer, "Improved Boosting Algorithms Using Confidence-Rated Predictions," Machine Learning, vol. 37, no. 3, pp. 297-336, 1999.
[19] Apache, Hadoop, http://lucene.apache.orghadoop/, 2006.
[20] J. Ekanayake, S. Pallickara, and G. Fox, "Mapreduce for Data Intensive Scientific Analyses," Proc. IEEE Fourth Int'l Conf. eScience, pp. 277-284, 2008.
[21] S. Pallickara and G. Fox, "Naradabrokering: A Distributed Middleware Framework and Architecture for Enabling Durable Peer-to-Peer Grids," Proc. ACM/IFIP/USENIX Int'l Conf. Middleware (Middleware), pp. 41-61, 2003.
[22] V.S. Verykios, E. Bertino, I.N. Fovino, L.P. Provenza, Y. Saygin, and Y. Theodoridis, "State-of-the-Art in Privacy Preserving Data Mining," SIGMOD Record, vol. 33, pp. 50-57, 2004.
[23] V.S. Iyengar, "Transforming Data to Satisfy Privacy Constraints," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 279-288, 2002.
[24] L. Sweeney, "Achieving k-Anonymity Privacy Protection Using Generalization and Suppression," Int'l J. Uncertainty Fuzziness Knowledge-Based Systems, vol. 10, pp. 571-588, 2002.
[25] A. Evfimievski, J. Gehrke, and R. Srikant, "Limiting Privacy Breaches in Privacy Preserving Data Mining," Proc. 22nd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, pp. 211-222, 2003.
[26] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M.Y. Zhu, "Tools for Privacy Preserving Distributed Data Mining," SIGKDD Explorations Newsletter, vol. 4, pp. 28-34, 2002.
[27] A. Frank and A. Asuncion, "UCI Machine Learning Repository," http://archive.ics.uci.eduml, 2010.
[28] M.S. Waterman and T.F. Smith, "Identification of Common Molecular Subsequences," J. Molecular Biology, vol. 147, no. 1, pp. 195-197, 1981.
[29] C. Stark, B.J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers, "Biogrid: A General Repository for Interaction Datasets," Nucleic Acids Research, vol. 34, no. suppl. 1, pp. 535-539, 2006.
[30] P. Cortez, J. Teixeira, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, "Using Data Mining for Wine Quality Assessment," Proc. 12th Int'l Conf. Discovery Science, pp. 66-79, 2009.
[31] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten, "The Weka Data Mining Software: An Update," SIGKDD Explorations, vol. 11, no. 1, pp. 10-18, 2009.
[32] R. Agrawal, T. Imielinski, and A. Swami, "Database Mining: A Performance Perspective," IEEE Trans. Knowledge and Data Eng., vol. 5, no. 6, pp. 914-925, Dec. 1993.
[33] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing. Addison-Wesley, 2003.
[34] C.K. Reddy and J.-H. Park, "Multi-Resolution Boosting for Classification and Regression Problems," Knowledge and Information Systems, vol. 29, pp. 435-456, 2011.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool