This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Regression Cubes with Lossless Compression and Aggregation
December 2006 (vol. 18 no. 12)
pp. 1585-1599
As OLAP engines are widely used to support multidimensional data analysis, it is desirable to support in data cubes advanced statistical measures, such as regression and filtering, in addition to the traditional simple measures such as count and average. Such new measures will allow users to model, smooth, and predict the trends and patterns of data. Existing algorithms for simple distributive and algebraic measures are inadequate for efficient computation of statistical measures in a multidimensional space. In this paper, we propose a fundamentally new class of measures, compressible measures, in order to support efficient computation of the statistical models. For compressible measures, we compress each cell into an auxiliary matrix with a size independent of the number of tuples. We can then compute the statistical measures for any data cell from the compressed data of the lower-level cells without accessing the raw data. Time- and space-efficient lossless aggregation formulae are derived for regression and filtering measures. Our analytical and experimental studies show that the resulting system, regression cube, substantially reduces the memory usage and the overall response time for statistical analysis of multidimensional data.

[1] S. Agarwal, R. Agrawal, P.M. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Sarawagi, “On the Computation of Multidimensional Aggregates,” Proc. Int'l Conf. Very Large Data Bases (VLDB '96), pp. 506-521, Sept. 1996.
[2] R. Agrawal, G. Psaila, E.L. Wimmers, and M. Zait, “Querying Shapes of Histories,” Proc. Int'l Conf. Very Large Data Bases (VLDB '95), pp. 502-514, Sept. 1995.
[3] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. Int'l Conf. Data Eng. (ICDE '95), pp. 3-14, Mar. 1995.
[4] S. Babu and J. Widom, “Continuous Queries Over Data Streams,” SIGMOD Record, vol. 30, pp. 109-120, 2001.
[5] K. Beyer and R. Ramakrishnan, “Bottom-Up Computation of Sparse and Iceberg Cubes,” Proc. ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '99), pp. 359-370, June 1999.
[6] G. Box and F.M. Jenkins, Time Series Analysis: Forecasting and Control, second ed. Holden-Day, 1976.
[7] S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing and OLAP Technology,” SIGMOD Record, vol. 26, pp. 65-74, 1997.
[8] B.C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan, “Prediction Cubes,” Proc. Very Large Data Bases Conf., 2005.
[9] Y. Chen, G. Dong, J. Han, B.W. Wah, and J. Wang, “Multidimensional Regression Analysis of Time-Series Data Streams,” Proc. Very Large Data Bases Conf., 2002.
[10] D. Cook and S. Weisberg, Applied Regression Including Computing and Graphics. John Wiley, 1999.
[11] P. Diggle, K. Liang, and S. Zeger, Analysis of Longitudinal Data. Oxford Science Publications, 1994.
[12] G. Dong, J. Han, J. Lam, J. Pei, and K. Wang, “Mining Multi-Dimensional Constrained Gradients in Data Cubes,” Proc. Int'l Conf. Very Large Data Bases (VLDB '01), pp. 321-330, Sept. 2001.
[13] J. Gehrke, F. Korn, and D. Srivastava, “On Computing Correlated Aggregates over Continuous Data Streams,” Proc. ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '01), pp. 13-24, May 2001.
[14] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries,” Proc. Int'l Conf. Very Large Data Bases (VLDB '01), pp. 79-88, Sept. 2001.
[15] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals,” Data Mining and Knowledge Discovery, vol. 1, pp. 29-54, 1997.
[16] M. Greenwald and S. Khanna, “Space-Efficient Online Computation of Quantile Summaries,” Proc. ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '01), pp. 58-66, May 2001.
[17] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering Data Streams,” Proc. Symp. Foundations of Computer Science (FOCS '00), pp. 359-366, 2000.
[18] J. Han, Y. Chen, G. Dong, J. Pei, B.W. Wah, J. Wang, and D. Cai, “Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams,” Distributed and Parallel Databases J., 2005.
[19] J. Han, G. Dong, and Y. Yin, “Efficient Mining of Partial Periodic Patterns in Time Series Database,” Proc. Int'l Conf. Data Eng. (ICDE '99), pp. 106-115, Apr. 1999.
[20] J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation of Iceberg Cubes With Complex Measures,” Proc. ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '01), pp. 1-12, May 2001.
[21] V. Harinarayan, A. Rajaraman, and J.D. Ullman, “Implementing Data Cubes Efficiently,” Proc. ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '96), pp. 205-216, June 1996.
[22] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery in Databases (KDD '01), Aug. 2001.
[23] T. Imielinski, L. Khachiyan, and A. Abdulghani, “Cubegrades: Generalizing Association Rules,” technical report, Dept. of Computer Science, Rutgers Univ., Aug. 2000.
[24] P. Indyk, N. Koudas, and S. Muthukrishnan, “Identifying Representative Trends in Massive Time Series Data Sets Using Sketches,” Proc. Int'l Conf. Very Large Data Bases (VLDB '00), pp.363-372, Sept. 2000.
[25] T. Kahveci and A.K. Singh, “Variable Length Queries for Time Series Data,” Proc. Int'l Conf. Data Eng. (ICDE '01), Mar. 2001.
[26] E.J. Keogh and M.J. Pazzani, “Scaling Up Dynamic Time Warping to Massive Dataset,” Proc. European Symp. Principle of Data Mining and Knowledge Discovery (PKDD '99), pp. 1-11, Sept. 1999.
[27] H. Lenz and B. Thalheim, “OLAP Databases and Aggregation Functions,” Proc. 13th Int'l Conf. Scientific and Statistical Database Management, pp. 91-100, 2001.
[28] Y.-S. Moon, K.-Y. Whang, and W.-K. Loh, “Duality-Based Subsequence Matching in Time-Series Databases,” Proc. Int'l Conf. Data Eng. (ICDE '01), pp. 263-272, Apr. 2001.
[29] T. Palpanas, N. Koudas, and A.O. Mendelzon, “Using Datacube Aggregates for Approximate Querying and Deviation Detection,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 11, pp. 1465-1477, Nov. 2005.
[30] S. Sarawagi, R. Agrawal, and N. Megiddo, “Discovery-Driven Exploration of OLAP Data Cubes,” Proc. Int'l Conf. Extending Database Technology (EDBT '98), pp. 168-182, Mar. 1998.
[31] G. Sathe and S. Sarawagi, “Intelligent Rollups in Multidimensional OLAP Data,” Proc. Int'l Conf. Very Large Data Bases (VLDB '01), pp. 531-540, Sept. 2001.
[32] P. Vassiliadis, “Modeling Multidimensional Databases, Cubes and Cube Operations,” Proc. 10th Int'l Conf. Scientific and Statistical Database Management, pp. 53-62, 1998.
[33] A.B. Williams and F.J. Taylor, Electronic Filter Design Handbook. McGraw-Hill, 1995.
[34] Y. Zhao, P.M. Deshpande, and J.F. Naughton, “An Array-Based Algorithm for Simultaneous Multidimensional Aggregates,” Proc. ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '97), pp.159-170, May 1997.

Index Terms:
Aggregation, compression, data cubes, OLAP.
Citation:
Yixin Chen, Guozhu Dong, Jiawei Han, Jian Pei, Benjamin W. Wah, Jianyong Wang, "Regression Cubes with Lossless Compression and Aggregation," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 12, pp. 1585-1599, Dec. 2006, doi:10.1109/TKDE.2006.196
Usage of this product signifies your acceptance of the Terms of Use.