Subscribe

Issue No.01 - January (2011 vol.23)

pp: 64-78

Smith Tsang , The University of Hong Kong, Hong Kong

Ben Kao , The University of Hong Kong, Hong Kong

Kevin Y. Yip , Yale University, New Haven

Wai-Shing Ho , The University of Hong Kong, Hong Kong

Sau Dan Lee , The University of Hong Kong, Hong Kong

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.175

ABSTRACT

Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty include measurement/quantization errors, data staleness, and multiple repeated measurements. With uncertainty, the value of a data item is often represented not by one single value, but by multiple values forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median), we discover that the accuracy of a decision tree classifier can be much improved if the "complete information” of a data item (taking into account the probability density function (pdf)) is utilized. We extend classical decision tree building algorithms to handle data tuples with uncertain values. Extensive experiments have been conducted which show that the resulting classifiers are more accurate than those using value averages. Since processing pdfs is computationally more costly than processing single values (e.g., averages), decision tree construction on uncertain data is more CPU demanding than that for certain data. To tackle this problem, we propose a series of pruning techniques that can greatly improve construction efficiency.

INDEX TERMS

Uncertain data, decision tree, classification, data mining.

CITATION

Smith Tsang, Ben Kao, Kevin Y. Yip, Wai-Shing Ho, Sau Dan Lee, "Decision Trees for Uncertain Data",

*IEEE Transactions on Knowledge & Data Engineering*, vol.23, no. 1, pp. 64-78, January 2011, doi:10.1109/TKDE.2009.175REFERENCES

- [1] R. Agrawal, T. Imielinski, and A.N. Swami, "Database Mining: A Performance Perspective,"
IEEE Trans. Knowledge and Data Eng., vol. 5, no. 6, pp. 914-925, Dec. 1993.- [2] J.R. Quinlan, "Induction of Decision Trees,"
Machine Learning, vol. 1, no. 1, pp. 81-106, 1986.- [3] J.R. Quinlan,
C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.- [4] C.L. Tsien, I.S. Kohane, and N. McIntosh, "Multiple Signal Integration by Decision Tree Induction to Detect Artifacts in the Neonatal Intensive Care Unit,"
Artificial Intelligence in Medicine, vol. 19, no. 3, pp. 189-202, 2000.- [5] G.L. Freed and J.K. Fraley, "25 Percent "Error Rate" in Ear Temperature Sensing Device,"
Pediatrics, vol. 87, no. 3, pp. 414-415, Mar. 1991.- [6] O. Wolfson and H. Yin, "Accuracy and Resource Consumption in Tracking and Location Prediction,"
Proc. Int'l Symp. Spatial and Temporal Databases (SSTD), pp. 325-343, July 2003.- [7] W. Street, W. Wolberg, and O. Mangasarian, "Nuclear Feature Extraction for Breast Tumor Diagnosis,"
Proc. SPIE, pp. 861-870, http://citeseer.ist.psu.edustreet93nuclear.html , 1993.- [8] N.N. Dalvi and D. Suciu, "Efficient Query Evaluation on Probabilistic Databases,"
The VLDB J., vol. 16, no. 4, pp. 523-544, 2007.- [9] E. Hung, L. Getoor, and V.S. Subrahmanian, "Probabilistic Interval XML,"
ACM Trans. Computational Logic (TOCL), vol. 8, no. 4, 2007.- [10] A. Nierman and H.V. Jagadish, "ProTDB: Probabilistic Data in XML,"
Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 646-657, Aug. 2002.- [11] J. Chen and R. Cheng, "Efficient Evaluation of Imprecise Location-Dependent Queries,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 586-595, Apr. 2007.- [12] M. Chau, R. Cheng, B. Kao, and J. Ng, "Uncertain Data Mining: An Example in Clustering Location Data,"
Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), pp. 199-204, Apr. 2006.- [13] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J.S. Vitter, "Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data,"
Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 876-887, Aug./Sept. 2004.- [14] R. Cheng, D.V. Kalashnikov, and S. Prabhakar, "Querying Imprecise Data in Moving Object Environments,"
IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp. 1112-1127, Sept. 2004.- [15] W.K. Ngai, B. Kao, C.K. Chui, R. Cheng, M. Chau, and K.Y. Yip, "Efficient Clustering of Uncertain Data,"
Proc. Int'l Conf. Data Mining (ICDM), pp. 436-445, Dec. 2006.- [16] S.D. Lee, B. Kao, and R. Cheng, "Reducing UK-Means to K-Means,"
Proc. First Workshop Data Mining of Uncertain Data (DUNE), in conjunction with the Seventh IEEE Int'l Conf. Data Mining (ICDM), Oct. 2007.- [17] H.-P. Kriegel and M. Pfeifle, "Density-Based Clustering of Uncertain Data,"
Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 672-677, Aug. 2005.- [18] C.K. Chui, B. Kao, and E. Hung, "Mining Frequent Itemsets from Uncertain Data,"
Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), pp. 47-58, May 2007.- [19] C.C. Aggarwal, "On Density Based Transforms for Uncertain Data Mining,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 866-875, Apr. 2007.- [20] O.O. Lobo and M. Numao, "Ordered Estimation of Missing Values,"
Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), pp. 499-503, Apr. 1999.- [21] L. Hawarah, A. Simonet, and M. Simonet, "A Probabilistic Approach to Classify Incomplete Objects Using Decision Trees,"
Proc. Int'l Conf. Database and Expert Systems Applications (DEXA), pp. 549-558, Aug./Sept. 2004.- [22] J.R. Quinlan, "Learning Logical Definitions from Relations,"
Machine Learning, vol. 5, pp. 239-266, 1990.- [23] Y. Yuan and M.J. Shaw, "Induction of Fuzzy Decision Trees,"
Fuzzy Sets and Systems, vol. 69, no. 2, pp. 125-139, 1995.- [24] M. Umanol, H. Okamoto, I. Hatono, H. Tamura, F. Kawachi, S. Umedzu, and J. Kinoshita, "Fuzzy Decision Trees by Fuzzy ID3 Algorithm and Its Application to Diagnosis Systems,"
Proc. IEEE Conf. Fuzzy Systems, IEEE World Congress Computational Intelligence, vol. 3, pp. 2113-2118, June 1994.- [25] C.Z. Janikow, "Fuzzy Decision Trees: Issues and Methods,"
IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 28, no. 1, pp. 1-14, Feb. 1998.- [26] C. Olaru and L. Wehenkel, "A Complete Fuzzy Decision Tree Technique,"
Fuzzy Sets and Systems, vol. 138, no. 2, pp. 221-254, 2003.- [27] T. Elomaa and J. Rousu, "General and Efficient Multisplitting of Numerical Attributes,"
Machine Learning, vol. 36, no. 3, pp. 201-244, 1999.- [28] U.M. Fayyad and K.B. Irani, "On the Handling of Continuous-Valued Attributes in Decision Tree Generation,"
Machine Learning, vol. 8, pp. 87-102, 1992.- [29] T. Elomaa and J. Rousu, "Efficient Multisplitting Revisited: Optima-Preserving Elimination of Partition Candidates,"
Data Mining and Knowledge Discovery, vol. 8, no. 2, pp. 97-126, 2004.- [30] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone,
Classification and Regression Trees. Wadsworth, 1984.- [31] T. Elomaa and J. Rousu, "Necessary and Sufficient Pre-Processing in Numerical Range Discretization,"
Knowledge and Information Systems, vol. 5, no. 2, pp. 162-182, 2003.- [32] L. Breiman, "Technical Note: Some Properties of Splitting Criteria,"
Machine Learning, vol. 24, no. 1, pp. 41-47, 1996.- [33] T.M. Mitchell,
Machine Learning. McGraw-Hill, 1997.- [34] A. Asuncion and D. Newman, UCI Machine Learning Repository, http://www.ics.uci.edu/mlearnMLRepository.html , 2007.
- [35] R.E. Walpole, and R.H. Myers,
Probability and Statistics for Engineers and Scientists. Macmillan Publishing Company, 1993.- [36] S. Tsang, B. Kao, K.Y. Yip, W.-S. Ho, and S.D. Lee, "Decision Trees for Uncertain Data,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 441-444, Mar./Apr. 2009. |