Subscribe

Issue No.01 - Jan. (2014 vol.26)

pp: 43-54

Sicheng Xiong , Oregon State University, Corvallis

Javad Azimi , InsightsOne Inc., Santa Clara

Xiaoli Z. Fern , Oregon State University, Corvallis

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2013.22

ABSTRACT

Semi-supervised clustering aims to improve clustering performance by considering user supervision in the form of pairwise constraints. In this paper, we study the active learning problem of selecting pairwise must-link and cannot-link constraints for semi-supervised clustering. We consider active learning in an iterative manner where in each iteration queries are selected based on the current clustering solution and the existing constraint set. We apply a general framework that builds on the concept of neighborhood, where neighborhoods contain "labeled examples" of different clusters according to the pairwise constraints. Our active learning method expands the neighborhoods by selecting informative points and querying their relationship with the neighborhoods. Under this framework, we build on the classic uncertainty-based principle and present a novel approach for computing the uncertainty associated with each data point. We further introduce a selection criterion that trades off the amount of uncertainty of each data point with the expected number of queries (the cost) required to resolve this uncertainty. This allows us to select queries that have the highest information rate. We evaluate the proposed method on the benchmark data sets and the results demonstrate consistent and substantial improvements over the current state of the art.

INDEX TERMS

Uncertainty, Nickel, Measurement uncertainty, Current measurement, Clustering algorithms, Probabilistic logic, Supervised learning,semi-supervised learning, Active learning, clustering

CITATION

Sicheng Xiong, Javad Azimi, Xiaoli Z. Fern, "Active Learning of Constraints for Semi-Supervised Clustering",

*IEEE Transactions on Knowledge & Data Engineering*, vol.26, no. 1, pp. 43-54, Jan. 2014, doi:10.1109/TKDE.2013.22REFERENCES

- [1] S. Basu, A. Banerjee, and R. Mooney, "Active Semi-Supervision for Pairwise Constrained Clustering,"
Proc. SIAM Int'l Conf. Data Mining, pp. 333-344, 2004.- [2] S. Basu, I. Davidson, and K. Wagstaff,
Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall, 2008.- [3] M. Bilenko, S. Basu, and R. Mooney, "Integrating Constraints and Metric Learning in Semi-Supervised Clustering,"
Proc. Int'l Conf. Machine Learning, pp. 11-18, 2004.- [4] I. Davidson, K. Wagstaff, and S. Basu, "Measuring Constraint-Set Utility for Partitional Clustering Algorithms,"
Proc. 10th European Conf. Principle and Practice of Knowledge Discovery in Databases, pp. 115-126, 2006.- [5] D. Greene and P. Cunningham, "Constraint Selection by Committee: An Ensemble Approach to Identifying Informative Constraints for Semi-Supervised Clustering,"
Proc. 18th European Conf. Machine Learning, pp. 140-151, 2007.- [6] D. Cohn, Z. Ghahramani, and M. Jordan, "Active Learning with Statistical Models,"
J. Artificial Intelligence Research, vol. 4, pp. 129-145, 1996.- [7] Y. Guo and D. Schuurmans, "Discriminative Batch Mode Active Learning,"
Proc. Advances in Neural Information Processing Systems, pp. 593-600, 2008.- [8] S. Hoi, R. Jin, J. Zhu, and M. Lyu, "Batch Mode Active Learning and Its Application to Medical Image Classification,"
Proc. 23rd Int'l Conf. Machine learning, pp. 417-424, 2006.- [9] S. Hoi, R. Jin, J. Zhu, and M. Lyu, "Semi-Supervised SVM Batch Mode Active Learning for Image Retrieval,"
Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-7, 2008.- [10] S. Huang, R. Jin, and Z. Zhou, "Active Learning by Querying Informative and Representative Examples,"
Proc. Advances in Neural Information Processing Systems, pp. 892-900, 2010.- [11] B. Settles, "Active Learning Literature Survey," technical report, 2010.
- [12] R. Huang and W. Lam, "Semi-Supervised Document Clustering via Active Learning with Pairwise Constraints,"
Proc. Int'l Conf. Date Mining, pp. 517-522, 2007.- [13] P. Mallapragada, R. Jin, and A. Jain, "Active Query Selection for Semi-Supervised Clustering,"
Proc. Int'l Conf. Pattern Recognition, pp. 1-4, 2008.- [14] Q. Xu, M. Desjardins, and K. Wagstaff, "Active Constrained Clustering by Examining Spectral Eigenvectors,"
Proc. Eighth Int'l Conf. Discovery Science, pp. 294-307, 2005.- [15] L. Breiman, "Random Forests,"
Machine learning, vol. 45, no. 1, pp. 5-32, 2001.- [16] M. Al-Razgan and C. Domeniconi, "Clustering Ensembles with Active Constraints,"
Applications of Supervised and Unsupervised Ensemble Methods, pp. 175-189, Springer, 2009.- [17] O. Shamir and N. Tishby, "Spectral Clustering on a Budget,"
J. Machine Learning Research - Proc. Track, vol. 15, pp. 661-669, 2011.- [18] K. Voevodski, M. Balcan, H. Röglin, S. Teng, and Y. Xia, "Active Clustering of Biological Sequences,"
J. Machine Learning Research, vol. 13, pp. 203-225, 2012.- [19] L. Breiman, "RF/Tools: A Class of Two-Eyed Algorithms,"
Proc. SIAM Workshop, Statistics Dept., 2003.- [20] T. Shi and S. Horvath, "Unsupervised Learning with Random Forest Predictors,"
J. Computational and Graphical Statistics, vol. 15, pp. 118-138, 2006.- [21] A. Frank and A. Asuncion, "UCI Machine Learning Repository," http://archive.ics.uci.eduml, 2010.
- [22] O. Mangasarian, W. Street, and W. Wolberg, "Breast Cancer Diagnosis and Prognosis via Linear Programming,"
Operations Research, vol. 43, no. 4, pp. 570-577, 1995.- [23] M. Little, P. McSharry, S. Roberts, D. Costello, and I. Moroz, "Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection,"
BioMedical Eng. OnLine, vol. 6, no. 1, p. 23, 2007.- [24] B. Sugato, "Wekaut, A Modified Version of Weka." http://www. cs.utexas.edu/users/ml/risccode /, 2011.
- [25] L. Kuncheva and S. Hadjitodorov, "Using Diversity in Cluster Ensembles,"
Proc. Int'l Conf. Systems, Man and Cybernetics, vol. 2, pp. 1214-1219, 2004.- [26] I. Davidson, S. Ravi, and L. Shamis, "A SAT-Based Framework for Efficient Constrained Clustering,"
Proc. SIAM Int'l Conf. Data Mining, 2010. |