This Article 
 Bibliographic References 
 Add to: 
Exploring Correlated Subspaces for Efficient Query Processing in Sparse Databases
February 2010 (vol. 22 no. 2)
pp. 219-233
Bin Cui, Peking University, Beijing
Jiakui Zhao, Peking University, Beijing
Dongqing Yang, Peking University, Beijing
Sparse data are becoming increasingly common and available in many real-life applications. However, relatively little attention has been paid to effectively model the sparse data and existing approaches such as the conventional "horizontal” and "vertical” representations fail to provide satisfactory performance for both storage and query processing, as such approaches are too rigid and generally do not consider the dimension correlations. In this paper, we propose a new approach, named HoVer, to store and conduct query for sparse data sets in an unmodified RDBMS, where HoVer stands for Horizontal representation over Vertically partitioned subspaces. According to the dimension correlations of sparse data sets, a novel mechanism has been developed to vertically partition a high-dimensional sparse data set into multiple lower-dimensional subspaces, and all the dimensions are highly correlated intrasubspace and highly unrelated intersubspace, respectively. Therefore, original data objects can be represented by the horizontal format in respective subspaces. With the novel HoVer representation, users can write SQL queries over the original horizontal view, which can be easily rewritten into queries over the subspace tables. Experiments over synthetic and real-life data sets show that our approach is effective in finding correlated subspaces and yields superior performance for the storage and query of sparse data.

[1] R. Agrawal, A. Somani, and Y. Xu, “Storage and Querying of E-Commerce Data,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 149-158, 2001.
[2] R. Agrawal, R. Srikant, and Y. Xu, “Database Technologies for Electronic Commerce,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 1055-1058, 2002.
[3] J.L. Beckmann, A. Halverson, R. Krishnamurthy, and J.F. Naughton, “Extending RDBMSs to Support Sparse Datasets Using an Interpreted Attribute Storage Format,” Proc. Int'l Conf. Data Eng. (ICDE), p. 58, 2006.
[4] G.P. Copeland and S. Khoshafian, “A Decomposition Storage Model,” Proc. ACM SIGMOD, pp. 268-279, 1985.
[5] S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications,” Data Mining and Knowledge Discovery, vol. 4, nos. 2/3, pp. 89-125, 2000.
[6] P. Shenoy, J.R. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, and D. Shah, “Turbo-Charging Vertical Mining of Large Databases,” Proc. ACM SIGMOD, pp. 22-33, 2000.
[7] J. Broekstra, A. Kampman, and F. van Harmelen, “Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema,” Proc. Int'l Semantic Web Conf. (ISWC), pp. 54-68, 2002.
[8] E.I. Chong, S. Das, G. Eadon, and J. Srinivasan, “An Efficient SQL-Based RDF Querying Scheme,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 1216-1227, 2005.
[9] P.A. Boncz, A.N. Wilschut, and M.L. Kersten, “Flattening an Object Algebra to Provide Performance,” Proc. Int'l Conf. Data Eng. (ICDE), pp. 568-577, 1998.
[10] S. Khoshafian, G.P. Copeland, T. Jagodis, H. Boral, and P. Valduriez, “A Query Processing Strategy for the Decomposed Storage Model,” Proc. Int'l Conf. Data Eng. (ICDE), pp. 636-643, 1987.
[11] R. Ramamurthy, D.J. DeWitt, and Q. Su, “A Case for Fractured Mirrors,” VLDB J., vol. 12, no. 2, pp. 89-101, 2003.
[12] S.S.B. Shi, E. Stokes, D. Byrne, C.F. Corn, D. Bachmann, and T. Jones, “An Enterprise Directory Solution with DB2,” IBM Systems J., vol. 39, no. 2, pp. 360-383, 2000.
[13] D.J. Abadi, “Column Stores for Wide and Sparse Data,” Proc. Conf. Innovative Data Systems Research (CIDR), pp. 292-297, 2007.
[14] M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E.J. O'Neil, P.E. O'Neil, A. Rasin, N. Tran, and S.B. Zdonik, “C-Store: A Column-Oriented DBMS,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 553-564, 2005.
[15] D.J. Abadi, A. Marcus, S. Madden, and K.J. Hollenbach, “Scalable Semantic Web Data Management Using Vertical Partitioning,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 411-422, 2007.
[16] E. Chu, J. Beckmann, and J. Naughton, “The Case for a Wide Table Approach to Manage Sparse Relational Data Sets,” Proc. ACM SIGMOD, pp. 821-832, 2007.
[17] S. Agrawal, V.R. Narasayya, and B. Yang, “Integrating Vertical and Horizontal Partitioning into Automated Physical Database Design,” Proc. ACM SIGMOD, pp. 359-370, 2004.
[18] S.B. Navathe, S. Ceri, G. Wiederhold, and J. Dou, “Vertical Partitioning Algorithms for Database Design,” ACM Trans. Database Systems, vol. 9, no. 4, pp. 680-710, 1984.
[19] J. Beckham, R. Krishnamurthy, and J.F. Naughton, “The Tradeoff between Horizontal and Vertical Representations of Sparse Data Sets,” technical report, application sekar_ecommerce.pdf, 2003.
[20] K. Wilkinson, C. Sayers, H.A. Kuno, and D. Reynolds, “Efficient RDF Storage and Retrieval in Jena2,” Proc. Int'l Workshop Semantic Web and Databases (SWDB), pp. 131-150, 2003.
[21] C. Baumgartner, C. Plant, K. Kailing, H.P. Kriegel, and P. Kröger, “Subspace Selection for Clustering High-Dimensional Data,” Proc. Int'l Conf. Data Mining (ICDM), pp. 11-18, 2004.
[22] M. Dash, K. Choi, P. Scheuermann, and H. Liu, “Feature Selection for Clustering—a Filter Solution,” Proc. Int'l Conf. Data Mining (ICDM), pp. 115-122, 2002.
[23] K. Kailing, H.P. Kriegel, P. Kröger, and S. Wanka, “Ranking Interesting Subspaces for Clustering High Dimensional Data,” Proc. European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 241-252, 2003.
[24] L.A. Goodman and W.H. Kruskal, “Measures of Association for Cross Classifications,” J. Am. Statistical Assoc., vol. 49, no. 268, pp. 733-764, 1954.
[25] S.M. Ali and S.D. Silvey, “A General Class of Coefficients of Divergence of One Distribution from Another,” J. Royal Statistical Soc., vol. 28, no. 1, pp. 131-142, 1966.
[26] T.R.C. Read and N.A.C. Cressic, Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, 1988.
[27] A. Paz and S. Moran, “Non Deterministic Polynomial Optimization Problems and Their Approximations,” Theoretical Computer Science, vol. 15, pp. 251-277, 1981.
[28] B. Yu, G. Li, B.C. Ooi, and L.-Z. Zhou, “One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing,” Proc. Conf. Innovative Data Systems Research (CIDR), pp. 142-153, 2007.

Index Terms:
Sparse database, query processing, correlation, subspace, HoVer.
Bin Cui, Jiakui Zhao, Dongqing Yang, "Exploring Correlated Subspaces for Efficient Query Processing in Sparse Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 2, pp. 219-233, Feb. 2010, doi:10.1109/TKDE.2009.66
Usage of this product signifies your acceptance of the Terms of Use.