loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
International Parallel and Distributed Processing Symposium (IPDPS'03)
A Compilation Framework for Distributed Memory Parallelization of Data Mining Algorithms
Nice, France
April 22-April 26
ISBN: 0-7695-1926-1
Xiaogang Li, Ohio State University
Ruoming Jin, Ohio State University
Gagan Agrawal, Ohio State University

With the availability of large datasets in a variety of scientific and commercial domains, data mining has emerged as an important area within the last decade. Data mining techniques focus on finding novel and useful patterns or models from large datasets. Because of the volume of the data to be analyzed, the amount of computation involved, and the need for rapid or even interactive analysis, data mining applications require the use of parallel machines.

We believe that parallel compilation technology can be used for providing high-level language support for carrying out data mining implementations. Our study of a variety of popular data mining techniques has shown that they can be parallelized in a similar fashion. In our previous work, we have developed a middleware system that exploits this similarity to support distributed memory parallelization and execution on disk-resident datasets.

This paper focuses on developing a data parallel language interface for using our middleware?s functionality. We use a data parallel dialect of Java and show that it is well suited for data mining algorithms. Compiler techniques for translating this dialect to a middleware specification are presented. The most significant of these is a new technique for extracting a global reduction function from a data parallel loop.

We present a detailed experimental evaluation of our compiler using apriori association mining, k-means clustering, and k-nearest neighbor classifiers. Our experimental results show that: 1) compiler generated parallel data mining codes achieve high speedups in a cluster environment, 2) the performance of compiler generated codes is quite close to the performance of manually written codes, and 3) simple additional optimizations like inlining can further reduce the gap between compiled and manual codes.

Citation:
Xiaogang Li, Ruoming Jin, Gagan Agrawal, "A Compilation Framework for Distributed Memory Parallelization of Data Mining Algorithms," ipdps, pp.7a, International Parallel and Distributed Processing Symposium (IPDPS'03), 2003
Usage of this product signifies your acceptance of the Terms of Use.