2015 IEEE International Conference on Data Mining (ICDM) (2015)

Atlantic City, NJ, USA

Nov. 14, 2015 to Nov. 17, 2015

ISSN: 1550-4786

ISBN: 978-1-4673-9503-8

pp: 853-858

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2015.53

ABSTRACT

Large and sparse datasets with a lot of missing values are common in the big data era. Naive Bayes is a good classification algorithm for such datasets, as its time and space complexity scales well with the size of non-missing values. However, several important questions about the behavior of naive Bayes are yet to be answered. For example, how different mechanisms of missing, data sparseness and the number of attributes systematically affect the learning curves and convergence? Recent work in classifying large and sparse real-world datasets still could not address these questions mainly because the data missing mechanisms of these datasets are not taken into account. In this paper, we propose two novel data missing and expansion mechanisms to answer these questions. We use the data missing mechanisms to generate large and sparse data with various properties, and study the entire learning curve and convergence behavior of naive Bayes. We made several observations, which are verified through detailed theoretical study. Our results are useful for learning large sparse data in practice.

INDEX TERMS

Prototypes, Convergence, Motion pictures, Upper bound, Big data, Training, Complexity theory

CITATION

Xiang Li,
Charles X. Ling,
Huaimin Wang,
"The Convergence Behavior of Naive Bayes on Large Sparse Datasets",

*2015 IEEE International Conference on Data Mining (ICDM)*, vol. 00, no. , pp. 853-858, 2015, doi:10.1109/ICDM.2015.53