2013 IEEE 13th International Conference on Data Mining (2012)

Brussels, Belgium Belgium

Dec. 10, 2012 to Dec. 13, 2012

ISSN: 1550-4786

ISBN: 978-1-4673-4649-8

pp: 972-977

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2012.89

ABSTRACT

Automated text classification is one of the most important learning technologies to fight information overload. However, the information society is not only confronted with an information flood but also with an increase in "information volatility", by which we understand the fact that kind and distribution of a data source's emissions can significantly vary. In this paper we show how to estimate the expected effectiveness of a classification solution when the underlying data source undergoes a shift in the distribution of its subclasses (modes). Subclass distribution shifts are observed among others in online media such as tweets, blogs, or news articles, where document emissions follow topic popularity. To estimate the expected effectiveness of a classification solution we partition a test sample by means of clustering. Then, using repetitive resampling with different margin distributions over the clustering, the effectiveness characteristics is studied. We show that the effectiveness is normally distributed and introduce a probabilistic lower bound that is used for model selection. We analyze the relation between our notion of expected effectiveness and the mean effectiveness over the clustering both theoretically and on standard text corpora. An important result is a heuristic for expected effectiveness estimation that is solely based on the initial test sample and that can be computed without resampling.

INDEX TERMS

clustering, Classification, Concept Drift, unknown distributions, Model Selection

CITATION

Nedim Lipka,
Benno Stein,
James G. Shanahan,
"Estimating the Expected Effectiveness of Text Classification Solutions under Subclass Distribution Shifts",

*2013 IEEE 13th International Conference on Data Mining*, vol. 00, no. , pp. 972-977, 2012, doi:10.1109/ICDM.2012.89