This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Compression and Aggregation for Logistic Regression Analysis in Data Cubes
April 2009 (vol. 21 no. 4)
pp. 479-492
Ruibin Xi, Washington University in St. Louis, St. Louis
Nan Lin, Washington University in St. Louis, St. Louis
Yixin Chen, Washington University in St Louis, St Louis
Logistic regression is an important technique for analyzing and predicting data with categorical attributes. In this paper, We consider supporting online analytical processing (OLAP) of logistic regression analysis for multi-dimensional data in a data cube where it is expensive in time and space to build logistic regression models for each cell from the raw data. We propose a novel scheme to compress the data in such a way that we can reconstruct logistic regression models to answer any OLAP query without accessing the raw data. Based on a first-order approximation to the maximum likelihood estimating equations, we develop a compression scheme that compresses each base cell into a small compressed data block with essential information to support the aggregation of logistic regression models. Aggregation formulae for deriving high-level logistic regression models from lower level component cells are given. We prove that the compression is nearly lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed compression and aggregation scheme can make feasible OLAP of logistic regression in a data cube.

[1] A. Agresti, An Introduction to Categorical Data Analysis. Wiley, 1996.
[2] A. Agresti, Categorical Data Analysis, second ed. John Wiley & Sons, 2002.
[3] D. Barbara and X. Wu, “Loglinear-Based Quasi Cubes,” J.Intelligent Information Systems, vol. 16, pp. 255-276, 2001.
[4] C.R. Charig, D.R. Webb, S.R. Payne, and O.E. Wickham, “Comparison of Treatment of Renal Calculi by Operative Surgery, Percutaneous Nephrolithotomy, and Extracorporeal Shock Wave Lithotripsy,” British Medical J., vol. 292, pp. 882-897, 1986.
[5] B. Chen, L. Chen, Y. Lin, and R. Ramakrishnan, “Prediction Cubes,” Proc. 31st Int'l Conf. Very Large Data Bases (VLDB '05), pp.982-993, 2005.
[6] K. Chen, I. Hu, and Z. Ying, “Strong Consistency of Maximum Quasi-Likelihood Estimators in Generalized Linear Models with Fixed and Adaptive Designs,” The Annals of Statistics, vol. 27, pp.1155-1163, 1999.
[7] Y. Chen, G. Dong, J. Han, J. Pei, B. Wah, and J. Wang, “Regression Cubes with Lossless Compression and Aggregation,” IEEE Trans. Knowledge and Data Eng., vol. 18, pp. 1585-1599, 2006.
[8] Y. Chen, G. Dong, J. Han, J. Pei, B.W. Wah, and J. Wang, “OLAPing Stream Data: Is It Feasible?” Proc. ACM SIGMOD '02 Workshop Research Issues in Data Mining and Knowledge Discovery, pp. 53-58, 2002.
[9] Y. Chen, G. Dong, J. Han, B.W. Wah, and J. Wang, “Multi-Dimensional Regression Analysis of Time-Series Data Streams,” Proc. 28th Int'l Conf. Very Large Data Bases (VLDB '02), pp. 323-334, 2002.
[10] Y. Chow and H. Teicher, Probability Theory, second ed. Springer, 1988.
[11] L. Fahrmeir and H. Kaufmann, “Consistency and Asymptotic Normality of the Maximum Likelihood Estimator in Generalized Linear Models,” The Annals of Statistics, vol. 13, pp. 342-368, 1985.
[12] “Centers for Disease Control and Prevention,” Behavioral Risk Factor Surveillance System Survey Data. US Dept. of Health and Human Services, Centers for Disease Control and Prevention, 2006.
[13] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh, “Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab and Sub-Totals,” Data Mining and Knowledge Discovery, vol. 1, pp. 29-54, 1997.
[14] J. Han, Y. Chen, G. Dong, J. Pei, B.W. Wah, J. Wang, and Y. Cai, “Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams,” Distributed and Parallel Databases, vol. 18, no. 2, pp.173-197, 2005.
[15] V. Harinarayan, A. Rajaraman, and J.D. Ullman, “Implementing Data Cubes Efficiently,” Proc. ACM SIGMOD '96, pp. 205-216, 1996.
[16] S.A. Julious and M.A. Mullee, “Confounding and Simpson's Paradox,” British Medical J., vol. 309, pp. 1480-1481, 1994.
[17] H. Lenz and B. Thalheim, “OLAP Databases and Aggregation Functions,” Proc. 13th Int'l Conf. Scientific and Statistical Database Management (SSDBM '01), pp. 91-100, 2001.
[18] C. Liu, M. Zhang, M. Zheng, and Y. Chen, “Step-by-Step Regression: A More Efficient Alternative for Polynomial Multiple Linear Regression in Stream Cube,” Proc. Seventh Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '03), pp. 437-448, 2003.
[19] P. McCullagh and J.A. Nelder, Generalized Linear Models, seconded. Chapman and Hall, 1989.
[20] T. Palpanas, N. Koudas, and A.O. Mendelzon, “Using Datacube Aggregates for Approximate Querying and Deviation Detection,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 11, pp. 1465-1477, Nov. 2005.
[21] S. Pang, S. Ozawa, and N. Kasabov, “Incremental Linear Discriminant Analysis for Classification of Data Streams,” IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 35, no. 5, pp.905-914, 2005.
[22] G. Sathe and S. Sarawagi, “Intelligent Rollups in Multidimensional OLAP Data,” Proc. 27th Int'l Conf. Very Large Data Bases (VLDB '01), pp. 531-540, 2001.
[23] P. Vassiliadis, “Modeling Multidimensional Databases, Cubes and Cube Operations,” Proc. 10th Int'l Conf. Scientific and Statistical Database Management (SSDBM '98), pp. 53-62, 1998.

Index Terms:
Data mining, Statistical databases
Citation:
Ruibin Xi, Nan Lin, Yixin Chen, "Compression and Aggregation for Logistic Regression Analysis in Data Cubes," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 4, pp. 479-492, April 2009, doi:10.1109/TKDE.2008.186
Usage of this product signifies your acceptance of the Terms of Use.