loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07)
Hybrid Clustering of Large Text Data
Niagara Falls, Ontario, Canada
May 21-May 23
ISBN: 0-7695-2847-3
Jacob Kogan, UMBC, USA
Clustering algorithms often require that the entire dataset be kept in the computer memory. When the dataset is large and does not fit into available memory one has to compress the dataset to make the application of clustering algorithms possible. The Balanced Iterative Reducing and Clustering algorithm (BIRCH) is a clustering algorithm designed to operate under the assumption "the amount of memory available is limited, whereas the dataset can be arbitrary large" [17]. The algorithm generates "a compact dataset summary" minimizing the I/O cost involved. The "summaries" contain enough information to apply the well known k-means clustering algorithm to the set of summaries and to generate partitions of the original dataset. An application of k-means requires an initial partition to be supplied as an input. To generate a "good" initial partition of the "summaries" this paper suggests a clustering algorithm, PDsDP, motivated by PDDP [3]. We report preliminary numerical experiments involving sequential applications of BIRCH, PDsDP, and k-means/Deterministic Annealing to the Enron email dataset.
Citation:
Jacob Kogan, "Hybrid Clustering of Large Text Data," ainaw, vol. 1, pp.367-372, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07), 2007
Usage of this product signifies your acceptance of the Terms of Use.