loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fifth IEEE International Conference on Data Mining (ICDM'05)
Finding Representative Set from Massive Data
Houston, Texas
November 27-November 30
ISBN: 0-7695-2278-5
Feng Pan, University of North Carolina at Chapel Hill
Wei Wang, University of North Carolina at Chapel Hill
Anthony K. H. Tung, National University of Singapore
Jiong Yang, Case Western Reserve University
In the information age, data is pervasive. In some applications, data explosion is a significant phenomenon. The massive data volume poses challenges to both human users and computers. In this project, we propose a new model for identifying representative set from a large database. A representative set is a special subset of the original dataset, which has three main characteristics: It is significantly smaller in size compared to the original dataset. It captures the most information from the original dataset compared to other subsets of the same size. It has low redundancy among the representatives it contains. We use information-theoretic measures such as mutual information and relative entropy to measure the representativeness of the representative set. We first design a greedy algorithm and then present a heuristic algorithm that delivers much better performance. We run experiments on two real datasets and evaluate the effectiveness of our representative set in terms of coverage and accuracy. The experiments show that our representative set attains expected characteristics and captures information more efficiently.
Citation:
Feng Pan, Wei Wang, Anthony K. H. Tung, Jiong Yang, "Finding Representative Set from Massive Data," icdm, pp.338-345, Fifth IEEE International Conference on Data Mining (ICDM'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.