Subscribe

Issue No.06 - June (2012 vol.61)

pp: 817-830

Yu Hua , Huazhong University of Science and Technology, Wuhan

Bin Xiao , The Hong Kong Polytechnic University, Hung Hom

Bharadwaj Veeravalli , The National University of Singapore, Singapore

Dan Feng , Huazhong University of Science and Technology, Wuhan

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2011.108

ABSTRACT

In many network applications, Bloom filters are used to support exact-matching membership query for their randomized space-efficient data structure with a small probability of false answers. In this paper, we extend the standard Bloom filter to Locality-Sensitive Bloom Filter (LSBF) to provide Approximate Membership Query (AMQ) service. We achieve this by replacing uniform and independent hash functions with locality-sensitive hash functions. Such replacement makes the storage in LSBF to be locality sensitive. Meanwhile, LSBF is space efficient and query responsive by employing the Bloom filter design. In the design of the LSBF structure, we propose a bit vector to reduce False Positives (FP). The bit vector can verify multiple attributes belonging to one member. We also use an active overflowed scheme to significantly decrease False Negatives (FN). Rigorous theoretical analysis (e.g., on FP, FN, and space overhead) shows that the design of LSBF is space compact and can provide accurate response to approximate membership queries. We have implemented LSBF in a real distributed system to perform extensive experiments using real-world traces. Experimental results show that LSBF, compared with a baseline approach and other state-of-the-art work in the literature (SmartStore and LSB-tree), takes less time to respond AMQ and consumes much less storage space.

INDEX TERMS

Approximate membership query, bloom filters, locality sensitive hashing.

CITATION

Yu Hua, Bin Xiao, Bharadwaj Veeravalli, Dan Feng, "Locality-Sensitive Bloom Filter for Approximate Membership Query",

*IEEE Transactions on Computers*, vol.61, no. 6, pp. 817-830, June 2012, doi:10.1109/TC.2011.108REFERENCES

- [1] L. Carter, R. Floyd, J. Gill, G. Markowsky, and M. Wegman, “Exact and Approximate Membership Testers,”
Proc. 10th Ann. ACM Symp. Theory of Computing (STOC '78), pp. 59-65, 1978.- [2] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search,”
Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB '07), pp. 950-961, 2007.- [3] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines,”
Proc. ACM SIGCOMM, 2006.- [4] Y. Zhu and H. Jiang, “False Rate Analysis of Bloom Filter Replicas in Distributed Systems,”
Proc. Int'l Conf. Parallel Processing (ICPP '06), pp. 255-262, 2006.- [5] W. Chang Feng, D.D. Kandlur, D. Saha, and K.G. Shin, “Stochastic Fair Blue: A Queue Management Algorithm for Enforcing Fairness,”
Proc. IEEE INFOCOM, 2001.- [6] F.M. Cuenca-Acuna, C. Peery, R.P. Martin, and T.D. Nguyen, “PlantP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities,”
Proc. IEEE 12th Int'l Symp. High Performance Distributed Computing, 2003.- [7] A. Pagh, R. Pagh, and S. Rao, “An Optimal Bloom Filter Replacement,”
Proc. 16th Ann. ACM-SIAM Symp. Discrete Algorithms (SODA '05), pp. 823-829, 2005.- [8] S. Dharmapurikar, P. Krishnamurthy, and D.E. Taylor, “Longest Prefix Matching Using Bloom Filters,”
Proc. ACM SIGCOMM, pp. 201-212, 2003.- [9] A. Broder and M. Mitzenmacher, “Using Multiple Hash Functions to Improve IP Lookups,”
Proc. IEEE INFOCOM, pp. 1454-1463, 2001.- [10] F. Baboescu and G. Varghese, “Scalable Packet Classification,”
IEEE/ACM Trans. Networking, vol. 13, no. 1, pp. 2-14, Feb. 2005.- [11] P. Indyk and R. Motwani, “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,”
Proc. 13th Ann. ACM Symp. Theory of Computing (STOC '98), pp. 604-613, 1998.- [12] A. Kirsch and M. Mitzenmacher, “Distance-Sensitive Bloom Filters,”
Proc. Eighth Workshop Algorithm Eng. and Experiments (ALENEX), 2006.- [13] A. Andoni and P. Indyk, “Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions,”
Comm. ACM, vol. 51, no. 1, pp. 117-122, 2008.- [14] L. Fan, P. Cao, J. Almeida, and A. Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,”
IEEE/ACM Trans. Networking, vol. 8, no. 3, pp. 281-293, June 2000.- [15] M. Mitzenmacher, “Compressed Bloom Filters,”
IEEE/ACM Trans. Networking, vol. 10, no. 5, pp. 604-612, Oct. 2002.- [16] Y. Hua, Y. Zhu, H. Jiang, D. Feng, and L. Tian, “Scalable and Adaptive Metadata Management in Ultra Large-scale File Systems,”
Proc. 28th Int'l Conf. Distributed Computing Systems (ICDCS '08), pp. 403-410, 2008.- [17] A. Kumar, J.J. Xu, J. Wang, O. Spatschek, and L.E. Li, “Space-Code Bloom Filter for Efficient Per-Flow Traffic Measurement,”
Proc. IEEE INFOCOM, pp. 1762-1773, 2004.- [18] C. Saar and M. Yossi, “Spectral Bloom Filters,”
Proc. ACM SIGMOD Int'l Conf. Management data , pp. 241-252, 2003.- [19] D. Guo, J. Wu, H. Chen, and X. Luo, “Theory and Network Application of Dynamic Bloom Filters,”
Proc. IEEE INFOCOM, 2006.- [20] B. Xiao and Y. Hua, “Using Parallel Bloom Filters for Multi-Attribute Representation on Network Services,”
IEEE Trans. Parallel and Distributed Systems, vol. 21, no. 1, pp. 20-32, Jan. 2010.- [21] H. Song, F. Hao, M. Kodialam, and T.V. Lakshman, “IPv6 Lookups Using Distributed and Load Balanced Bloom Filters for 100Gbps Core Router Line Cards,”
Proc. IEEE INFOCOM, 2009.- [22] F. Hao, M. Kodialam, T.V. Lakshman, and H. Song, “Fast Multiset Membership Testing Using Combinatorial Bloom Filters,”
Proc. IEEE INFOCOM, 2009.- [23] F. Hao, M. Kodialam, and T.V. Lakshman, “Incremental Bloom Filters,”
Proc. IEEE INFOCOM, pp. 1741-1749, 2008.- [24] A. Broder and M. Mitzenmacher, “Network Applications of Bloom Filters: A Survey,”
Internet Math., vol. 1, pp. 485-509, 2005.- [25] A. Joly and O. Buisson, “A Posteriori Multi-Probe Locality Sensitive Hashing,”
Proc. ACM Multimedia, 2008.- [26] Y. Hua, B. Xiao, D. Feng, and B. Yu, “Bounded LSH for Similarity Search in Peer-to-Peer File Systems,”
Proc. Int'l Conf. Parallel Processing, pp. 644-651, 2008.- [27] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, “Locality-Sensitive Hashing Scheme Based on p-Stable Distributions,”
Proc. Ann. Symp. Computational Geometry, pp. 253-262, 2004.- [28] A. Andoni, M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, “Locality-Sensitive Hashing Using Stable Distributions,”
Nearest Neighbor Methods in Learning and Vision: Theory and Practice, T. Darrell and P. Indyk and G. Shakhnarovich, eds., MIT Press, 2006.- [29] M. Charikar, “Similarity Estimation Techniques from Rounding Algorithms,”
Proc. 34th Ann. ACM Symp. Theory of Computing (STOC '02), pp. 380-388, 2002.- [30] N. Agrawal, W. Bolosky, J. Douceur, and J. Lorch, “A Five-Year Study of File-System Metadata,”
Proc. Fifth USENIX Conf. File and Storage Technologies (FAST), 2007.- [31] The Forest CoverType data set, “UCI Machine Learning Repository,” http://archive.ics.uci.edu/ml/data setsCovertype , 2011.
- [32] S. Kavalanekar, B. Worthington, Q. Zhang, and V. Sharda, “Characterization of Storage Workload Traces from Production Windows Servers,”
Proc. IEEE Int'l Symp. Workload Characterization (IISWC), 2008.- [33] Y. Hua, H. Jiang, Y. Zhu, D. Feng, and L. Tian, “SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems,”
Proc. ACM/IEEE Supercomputing Conf. (SC), 2009.- [34] Y. Tao, K. Yi, C. Sheng, and P. Kalnis, “Quality and Efficiency in High-Dimensional Nearest Neighbor Search,”
Proc. 35th SIGMOD Int'l Conf. Management of Data (SIGMOD '09), 2009.- [35] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,”
Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '84), pp. 47-57, 1984.- [36] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,”
Proc. 25th Int'l Conf. Very Large Data Bases (VLDB '99), pp. 518-529, 1999.- [37] A. Leung, I. Adams, and E.L. Miller, “Magellan: A Searchable Metadata Architecture for Large-Scale File Systems,” Technical Report UCSC-SSRC-09-07, Univ. of California, Nov. 2009.
- [38] V. Athitsos, M. Potamias, P. Papapetrou, and G. Kollios, “Nearest Neighbor Retrieval Using Distance-Based Hashing,”
Proc. IEEE 24th Int'l Conf. Data Eng. (ICDE), 2008.- [39] Y. Hua, Y. Zhu, H. Jiang, D. Feng, and L. Tian, “Supporting Scalable and Adaptive Metadata Management in Ultra Large-Scale File Systems,”
IEEE Trans. Parallel and Distributed Systems, vol. 22, no. 4, pp. 580-593, Apr. 2011.- [40] J. Bruck, J. Gao, and A. Jiang, “Weighted Bloom Filter,”
Proc. IEEE Int'l Symp. Information Theory, pp. 2304-2308, 2006.- [41] M. Zhong, P. Lu, K. Shen, and J. Seiferas, “Optimizing Data Popularity Conscious Bloom Filters,”
Proc. 27th ACM Symp. Principles of Distributed Computing (PODC '08), 2008.- [42] F. Hao, M. Kodialam, and T. Lakshman, “Building High Accuracy Bloom Filters Using Partitioned Hashing,”
Proc. ACM SIGMETRICS Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS '07), pp. 277-288, 2007.- [43] B. Donnet, B. Baynat, and T. Friedman, “Retouched Bloom Filters: Allowing Networked Applications to Trade off Selected False Positives Against False Negatives,”
Proc. ACM CoNEXT Conf., 2006. |