Utility and Cloud Computing, IEEE Internatonal Conference on (2011)
Melbourne, Victoria Australia
Dec. 5, 2011 to Dec. 8, 2011
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/UCC.2011.15
There is often a need to cluster voluminous amounts of data. Such clustering has application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms viz. K-means, Fuzzy k-means, Dirichlet, and Latent Dirichlet Allocation within two different cloud runtimes: Hadoop and Granules. Our benchmarks use identical clustering code with both Hadoop and Granules. The difference between these implementations stem from how the Hadoop and Granules runtimes (1) support and manage the lifecycle of individual computations, and (2) how they orchestrate exchange of data between different stages of the computational pipeline during successive iterations of the clustering algorithm. We also include an analysis of our results for each of these clustering algorithms in a distributed setting.
Machine Learning, Distributed Stream Processing, Hadoop, Mahout, Clustering, Granules
S. Pallickara and K. Ericson, "On the Performance of Distributed Clustering Algorithms in File and Streaming Processing Systems," 2011 IEEE 4th International Conference on Utility and Cloud Computing (UCC 2011)(UCC), Victoria, NSW, 2011, pp. 33-40.