This Article 
 Bibliographic References 
 Add to: 
TAPER: A Two-Step Approach for All-Strong-Pairs Correlation Query in Large Databases
April 2006 (vol. 18 no. 4)
pp. 493-508
Given a user-specified minimum correlation threshold \theta and a market-basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold \theta. However, when the number of items and transactions are large, the computation cost of this query can be very high. The goal of this paper is to provide computationally efficient algorithms to answer the all-strong-pairs correlation query. Indeed, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient, but also exhibits special monotone properties which allow pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corElation queRy (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent of or improves when the number of items is increased in data sets with Zipf-like or linear rank-support distributions. Experimental results from synthetic and real-world data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives. Finally, we demonstrate that the algorithmic ideas developed in the TAPER algorithm can be extended to efficiently compute negative correlation and uncentered Pearson's correlation coefficient.

[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACM SIGMOD Int’l Conf. Management of Data, pp. 207-216, 1993.
[2] C. Alexander, Market Models: A Guide to Financial Data Analysis. John Wiley & Sons, 2001.
[3] R. Bayardo, R. Agrawal, and D. Gunopulos, “Constraint-Based Rule Mining in Large, Dense Databases,” Data Mining and Knowledge Discovery J., pp. 217-240, 2000.
[4] S. Brin, R. Motwani, and C. Silverstein., “Beyond Market Baskets: Generalizing Association Rules to Correlations,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 265-276, 1997.
[5] C. Bucila, J. Gehrke, D. Kifer, and W.M. White, “Dualminer: A Dual-Pruning Algorithm for Itemsets with Constraints,” Data Mining and Knowledge Discovery J. pp. 241-272, 2003.
[6] D. Burdick, M. Calimlim, and J. Gehrke, “Mafia: A Maximal Frequent Itemset Algorithm for Transactional Databases,” Proc. 17th Int’l Conf. Data Eng. (ICDE), pp. 443-452, 2001.
[7] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang, “Finding Interesting Associations without Support Pruning,” Proc. 16th Int’l Conf. Data Eng. (ICDE), pp. 489-499, 2000.
[8] P. Cohen, J. Cohen, S.G. West, and L. Aiken, Applied Multiple Regression/Correlation Analysis for the Behavioral Science, third ed. Lawrence Erlbaum Assoc., 2002.
[9] W. Dumouchel and D. Pregibon, “Empirical Bayes Screening for Multi-Item Associations,” Proc. Seventh ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 67-76, 2001.
[10] G. Grahne, L.S. Lakshmanan, and X. Wang, “Efficient Mining of Constrained Correlated Sets,” Proc. 16th Int’l Conf. Data Eng., pp. 512-521, 2000.
[11] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. 2000 ACM SIGMOD Int’l Conf. Management of Data, 2000.
[12] I.F. Ilyas, V. Markl, P.J. Haas, P. Brown, and A. Aboulnaga, “Cords: Automatic Discovery of Correlations and Soft Functional Dependencies,” Proc. 2004 ACM SIGMOD Int’l Conf. Management of Data, 2004.
[13] C. Jermaine, “The Computational Complexity of High-Dimensional Correlation Search,” Proc. 2001 IEEE Int’l Conf. Data Mining (ICDM), pp. 249-256, 2001.
[14] C. Jermaine, “Playing Hide-and-Seek with Correlations,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, 2003.
[15] S.K. Kachigan, Multivariate Statistical Analysis: A Conceptual Introduction, second ed., Radius Press, 1991.
[16] M. Kendall and J.D. Gibbons, Rank Correlation Methods, fifth ed. Oxford Univ., 1990.
[17] W.P. Kuo, T. Jenssen, A.J. Butte, L. Ohno-Machado, and I.S. Kohane, “Analysis of Matched mRNA Measurements from Two Different Microarray Technologies,” Bioinformatics, vol. 18, no. 3, 2002.
[18] E.L. Lehmann and H.J.M. D'Abrera, Nonparametrics: Statistical Methods Based on Ranks. Prentice Hall, 1998.
[19] B. Liu, W. Hsu, and Y. Ma, “Mining Association Rules with Multiple Minimum Supports,” Proc. ACM SIGKDD, 1999.
[20] R. Ng, L.V.S. Lakshmanan, J. Han, and A. Pang, “Exploratory Mining via Constrained Frequent Set Queries,” Proc. ACM SIGMOD Int’l Conf. Management Data, 1999.
[21] R. Rastogi and K. Shim, “Mining Optimized Association Rules with Categorical and Numeric Attributes,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 1, Jan. 2002.
[22] H.T. Reynolds, The Analysis of Cross-Classifications. New York: The Free Press, 1977.
[23] C.J. Van Rijsbergen., Information Retrieval, second ed. London: Butterworths, 1979.
[24] R. Rymon, “Search through Systematic Set Enumeration,” Proc. Third Int’l Conf. Principles of Knowledge Representation and Reasoning, pp. 539-550, 1992.
[25] H.V. Storch and F.W. Zwiers, Statistical Analysis in Climate Research, reprint ed. Cambridge Univ. Press, Feb. 2002.
[26] K. Wang, Y. He, D. Cheung, and Y.L. Chin, “Mining Confident Rules without Support Requirement,” Proc. 2001 ACM Int'l Conf. Information and Knowledge Management (CIKM), 2001.
[27] X. Wu, C. Zhang, and S. Zhang, “Mining Both Positive and Negative Association Rules,” Proc. 19th Int’l Conf. Machine Learning (ICML), pp. 658-665, 2002.
[28] H. Xiong, S. Shekhar, P. Tan, and V. Kumar, “Exploiting a Support-Based Upper Bound of Pearson's Correlation Coefficient for Efficiently Identifying Strongly Correlated Pairs,” Proc. 10th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 334-343, Aug. 2004.
[29] G.K. Zipf, Human Behavior and Principle of Least Effort: An Introduction to Human Ecology, Cambridge, Mass: Addison Wesley Press, 1949.

Index Terms:
Association analysis, data mining, Pearson's correlation coefficient, statistical computing.
Hui Xiong, Shashi Shekhar, Pang-Ning Tan, Vipin Kumar, "TAPER: A Two-Step Approach for All-Strong-Pairs Correlation Query in Large Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 4, pp. 493-508, April 2006, doi:10.1109/TKDE.2006.68
Usage of this product signifies your acceptance of the Terms of Use.