This Article 
 Bibliographic References 
 Add to: 
Managing Statistical Behavior of Large Data Sets in Shared-Nothing Architectures
November 1998 (vol. 9 no. 11)
pp. 1073-1087

Abstract—Increasingly larger data sets are being stored in networked architectures. Many of the available data structures are not easily amenable to parallel realizations. Hashing schemes show promise in that respect for the simple reason that the underlying data structure can be decomposed and spread among the set of cooperating nodes with minimal communication and maintenance requirements. In all cases, storage utilization and load balancing are issues that need to be addressed. One can identify two basic approaches to tackle the problem. One way is to address it as part of the design of the data structure that is used to store and retrieve the data. The other is to maintain the data structure intact but address the problem separately. The method that we present here falls in the latter category and is applicable whenever a hash table is the preferred data structure. Intrinsically attached to the used hash table is a hashing function that allows one to partition a possibly unbounded set of data items into a finite set of groups; the hashing function provides the partitioning by assigning each data item to one of the groups. In general, the hashing function cannot guarantee that the various groups will have the same cardinality, on average, for all possible data item distributions. In this paper, we propose a two-stage methodology that uses the knowledge of the hashing function to reorganize the group assignments so that the resulting groups have similar expected cardinalities. The method is generally applicable and independent of the used hashing function. We show the power of the methodology using both synthetic and real-world databases. The derived quasi-uniform storage occupancy and associated load-balancing gains are significant.

[1] D.H. Ballard and C.M. Brown, Computer Vision, Prentice Hall, Upper Saddle River, N.J., 1982.
[2] F. Bastani, S. Iyengar, and I. Yen, "Concurrent Maintenance of Data Structures in a Distributed Environment," The Computer J., vol. 31, no. 12, pp. 165-174, 1988.
[3] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Conf. Management of Data, 1990.
[4] J.L. Bentley and J.H. Friedman, "Data Structures for Range Searching," ACM Computing Surveys, vol. 11, no. 4, pp. 397-409, 1979.
[5] A. Califano and R. Mohan, "Multidimensional Indexing for Recognizing Visual Shapes," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 4, pp. 373-392, 1994.
[6] J.L. Carter and M. Wegman, "Universal Classes of Hash Functions," J. Computer and System Sciences, vol. 18, no. 2, 1979.
[7] T.L. Casavant and J.G. Kuhl,“A taxonomy of scheduling in general-purpose distributed computing systems,” IEEE Trans. on Software Engineering, vol. 14, no. 2. Feb. 1988.
[8] M. Costa, R. Haralick, and L. Shapiro, "Optimal Affine-Invariant Matching: Performance Characterization," Proc. SPIE Conf. Image Storage and Retrieval,San Jose, Calif., Jan. 1992.
[9] R.F. Deutscher, P.G. Sorenson, and J.P. Tremblay, "Distribution-Dependent Hashing Functions and Their Characteristics," Proc. SIGMOD Conf., pp. 224-236,San Jose, Calif., 1975.
[10] R. Duda and P. Hart, Pattern Classification and Scene Analysis.New York: John Wiley and Sons, 1973.
[11] C. Ellis, "Distributed Data Structures: A Case Study," IEEE Trans. Computers, vol. 34, no. 12, pp. 1,178-1,185, Dec. 1985.
[12] C. Faloutsos, "Gray Codes for Partial Match and Range Queries," IEEE Trans. Software Eng., vol. 14, no. 10, pp. 1,381-1,393, Oct. 1987.
[13] C. Faloutsos and I. Kamel, “Beyond Uniformity and Independence: Analysis of R-Trees Using the Concept of Fractal Dimension,” Proc. 13th ACM Symp. Principles of Database Systems (PODS), 1994.
[14] K.K. Goswami, M. Devarakonda, and R.K. Iyer, "Prediction-Based Dynamic Load-Sharing Heuristics," IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 6, pp. 638-648, June 1993.
[15] O. Guenther and A. Buchmann, "Research Issues in Spatial Databases," ACM SIGMOD Record, vol. 19, no. 4, pp. 61-67, 1990.
[16] A. Gueziec and N. Ayache,“Smoothing and matching of 3D space curves,” Second European Conf. Computer Vision, pp. 620-629,Santa Margherita Ligure, Italy, Springer-Verlag, 1992.
[17] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Conf. Management of Data, 1984.
[18] J.A Hartigan,Clustering Algorithms, John Wiley and Sons, New York, N.Y., 1975.
[19] T. Honishi, T. Satoh, and U. Inoue, "An Index Structure for Parallel Database Processing," IEEE Second Int'l Workshop Research Issues on Data Eng., pp. 224-225, 1992.
[20] Y. Ioannidis, "Universality of Serial Histograms," Proc. 19th VLDB Conf.,Dublin, Ireland, 1993.
[21] H.V. Jagadish, "Linear Clustering of Objects with Multiple Attributes," Proc. Int'l Conf. Management of Data, pp. 332-342, ACM SIGMOD, 1990.
[22] T. Johnson, P. Krishna, and A. Colbrook, "Distributed Indices for Accessing Distributed Data," Proc. 12th IEEE Symp. Mass Storage Systems, pp. 199-207, 1993.
[23] I. Kamel and C. Faloutsos, "Parallel R-Trees," Proc. ACM SIGMOD Conf., pp. 195-204, 1992.
[24] G. Knott, "Hashing Functions," The Computer J., vol. 18, no. 3, Mar. 1975.
[25] O. Kremien and J. Kramer, "Methodical Analysis of Adaptive Load Sharing Algorithms," IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 11, pp. 747-760, Nov. 1992.
[26] B. Kroll and P. Widmayer, "Distributing a Search Structure Among a Growing Number of Processors," Proc. ACM SIGMOD Conf., pp. 265-276, 1994.
[27] Y. Lamdan and H.J. Wolfson, "Geometric hashing: A general and efficient model-based recognition scheme," Second Int'l Conf. Computer Vision, pp. 238-249, 1988.
[28] P. Lehman and S. Yao, "Efficient Locking for Concurrent Operations on B-Trees," ACM Trans. Database Systems, vol. 6, no. 4, pp. 650-670, 1981.
[29] M. Livny and M. Melman, "Load Balancing in Homogeneous Broadcast Distributed Systems," Proc. ACM Computer Network Performance Symp., pp. 47-55, 1982.
[30] G. Matsliach and O. Shmueli, "An Efficient Method for Distributing Search Structures," Proc. First Int'l Conf. Parallel and Distributed Information Systems, pp. 159-166, 1991.
[31] R. Mohan, D. Weinshall, and R. Sarukkai, "3D Object Recognition by Indexing Structural Invariants from Multiple Views," Proc. Int'l Conf. Computer Vision, pp. 264-268,Berlin, May 1993.
[32] J. Nievergelt, H. Hinterberger, and K.C. Sevcik, "The Grid File: An Adaptable, Symmetric Multikey File Structure," ACM Trans. Database Systems, vol. 9, no. 1, pp. 38-71, Mar. 1984.
[33] R. Nussinov and H.J. Wolfson, "Efficient Detection of Three-Dimensional Motifs in Biological Macromolecules by Computer Vision Techniques," Proc. Nat'l Academy Science USA, vol. 88, pp. 10,495-10,499, 1991.
[34] A. Papoulis, Probability, Random Variables, and Stochastic Processes.New York: Mc Graw-Hill, 1984.
[35] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C.Cambridge, England: Cambridge Univ. Press, 1988.
[36] M.V. Ramakrishna, "Hashing in Practice, Analysis of Hashing and Universal Hashing," Proc. 1988 ACM-SIGMOD Conf. Management of Data,Chicago, 1988.
[37] M.V. Ramakrishna, E. Fu, and E. Bahcekapili, "Efficient Hardware Hashing Functions for High Performance Computers," IEEE Trans. Computers, vol. 46, no. 12, pp. 1,378-1,381, Dec. 1997.
[38] I. Rigoutsos, "Well-Behaved, Tunable 3D Affine-Invariants," Proc. IEEE Computer Vision and Pattern Recognition Conf.,Santa Barbara, Calif., June 1998.
[39] I. Rigoutsos and A. Califano, "Searching in Parallel for Similar Strings," IEEE Computational Science and Eng., pp. 60-75, Summer 1994.
[40] I. Rigoutsos, D. Platt, A. Califano, and B.D. Silverman, "Representation and Matching of Small Flexible Molecules in Large Databases of 3D-Molecular Information," Pattern Discovery in Molecular Biology, J. Wang, B. Shapiro, and D. Shasha, eds. Oxford Univ. Press, to appear.
[41] R.N. Sahner, K.S. Trivedi, and A. Puliafito, Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the Sharpe Software Package. Kluwer Academic, 1995.
[42] B. Salzberg, "Practical Spatial Database Access Methods," IEEE Symp. Applied Computing, pp. 82-90, 1991.
[43] T. Sellis, N. Roussopoulos, and C. Faloutsos, “The R+-Tree: A Dynamic Index for Multidimensional Objects,” Proc. 13th Int'l Conf. Very Large Data Bases (VLDB), 1987.
[44] A. Svensson, "History, and Intelligent Load Sharing Filter," Proc. 10th Int'l Conf. Distributed Computing Systems, 1991.
[45] X. Wang, J. Wang, D. Shasha, B. Shapiro, S. Dikshitulu, I. Rigoutsos, and K. Zhang, "Automated Discovery of Active Motifs in Three Dimensional Molecules," Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD '97),Newport Beach, Calif., Aug. 1997.
[46] G. Wiederhold, Database Design, Computer Science Series, second ed. New York: McGraw Hill, 1983.
[47] M. Willebeck-LeMair and A. Reeves, “Strategies for Dynamic Load Balancing on Highly Parallel Computers,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 9, pp. 979-993, Sept. 1993.
[48] S. Zhou, M. Stumm, K. Li, and D. Wortman, "Heterogeneous Distributed Shared Memory," IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 5, pp. 540-554, Sept. 1992.
[49] G.K. Zipf, Human Behavior and the Principle of Least Effort.Reading, Mass.: Addison-Wesley, 1949.

Index Terms:
Hashing, large databases, statistical behavior, uniform occupancy, load balancing, shared-nothing architectures.
Isidore Rigoutsos, Alex Delis, "Managing Statistical Behavior of Large Data Sets in Shared-Nothing Architectures," IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 11, pp. 1073-1087, Nov. 1998, doi:10.1109/71.735955
Usage of this product signifies your acceptance of the Terms of Use.