|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
2010 IEEE International Conference on Cluster Computing
Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases
Heraklion, Greece
September 20-September 24
ISBN: 978-0-7695-4220-1
| ASCII Text | x | ||
| Philippe Pébay, David Thompson, Janine Bennett, "Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases," 2012 IEEE International Conference on Cluster Computing, pp. 156-165, 2010 IEEE International Conference on Cluster Computing, 2010. | |||
| BibTex | x | ||
| @article{ 10.1109/CLUSTER.2010.43, author = {Philippe Pébay and David Thompson and Janine Bennett}, title = {Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases}, journal ={2012 IEEE International Conference on Cluster Computing}, volume = {0}, year = {2010}, isbn = {978-0-7695-4220-1}, pages = {156-165}, doi = {http://doi.ieeecomputersociety.org/10.1109/CLUSTER.2010.43}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - 2012 IEEE International Conference on Cluster Computing TI - Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases SN - 978-0-7695-4220-1 SP156 EP165 A1 - Philippe Pébay, A1 - David Thompson, A1 - Janine Bennett, PY - 2010 VL - 0 JA - 2012 IEEE International Conference on Cluster Computing ER - | |||
Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and c2 independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.
Citation:
Philippe Pébay, David Thompson, Janine Bennett, "Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases," cluster, pp.156-165, 2010 IEEE International Conference on Cluster Computing, 2010
Usage of this product signifies your acceptance of the Terms of Use.
