
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Isidore Rigoutsos, Alex Delis, "Managing Statistical Behavior of Large Data Sets in SharedNothing Architectures," IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 11, pp. 10731087, November, 1998.  
BibTex  x  
@article{ 10.1109/71.735955, author = {Isidore Rigoutsos and Alex Delis}, title = {Managing Statistical Behavior of Large Data Sets in SharedNothing Architectures}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {9}, number = {11}, issn = {10459219}, year = {1998}, pages = {10731087}, doi = {http://doi.ieeecomputersociety.org/10.1109/71.735955}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  Managing Statistical Behavior of Large Data Sets in SharedNothing Architectures IS  11 SN  10459219 SP1073 EP1087 EPD  10731087 A1  Isidore Rigoutsos, A1  Alex Delis, PY  1998 KW  Hashing KW  large databases KW  statistical behavior KW  uniform occupancy KW  load balancing KW  sharednothing architectures. VL  9 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—Increasingly larger data sets are being stored in networked architectures. Many of the available data structures are not easily amenable to parallel realizations. Hashing schemes show promise in that respect for the simple reason that the underlying data structure can be decomposed and spread among the set of cooperating nodes with minimal communication and maintenance requirements. In all cases, storage utilization and load balancing are issues that need to be addressed. One can identify two basic approaches to tackle the problem. One way is to address it as part of the design of the data structure that is used to store and retrieve the data. The other is to maintain the data structure intact but address the problem separately. The method that we present here falls in the latter category and is applicable whenever a hash table is the preferred data structure. Intrinsically attached to the used hash table is a hashing function that allows one to partition a possibly unbounded set of data items into a finite set of groups; the hashing function provides the partitioning by assigning each data item to one of the groups. In general, the hashing function cannot guarantee that the various groups will have the same cardinality, on average, for all possible data item distributions. In this paper, we propose a twostage methodology that uses the knowledge of the hashing function to reorganize the group assignments so that the resulting groups have similar expected cardinalities. The method is generally applicable and independent of the used hashing function. We show the power of the methodology using both synthetic and realworld databases. The derived quasiuniform storage occupancy and associated loadbalancing gains are significant.
[1] D.H. Ballard and C.M. Brown, Computer Vision, Prentice Hall, Upper Saddle River, N.J., 1982.
[2] F. Bastani, S. Iyengar, and I. Yen, "Concurrent Maintenance of Data Structures in a Distributed Environment," The Computer J., vol. 31, no. 12, pp. 165174, 1988.
[3] N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger, “The R*Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Conf. Management of Data, 1990.
[4] J.L. Bentley and J.H. Friedman, "Data Structures for Range Searching," ACM Computing Surveys, vol. 11, no. 4, pp. 397409, 1979.
[5] A. Califano and R. Mohan, "Multidimensional Indexing for Recognizing Visual Shapes," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 4, pp. 373392, 1994.
[6] J.L. Carter and M. Wegman, "Universal Classes of Hash Functions," J. Computer and System Sciences, vol. 18, no. 2, 1979.
[7] T.L. Casavant and J.G. Kuhl,“A taxonomy of scheduling in generalpurpose distributed computing systems,” IEEE Trans. on Software Engineering, vol. 14, no. 2. Feb. 1988.
[8] M. Costa, R. Haralick, and L. Shapiro, "Optimal AffineInvariant Matching: Performance Characterization," Proc. SPIE Conf. Image Storage and Retrieval,San Jose, Calif., Jan. 1992.
[9] R.F. Deutscher, P.G. Sorenson, and J.P. Tremblay, "DistributionDependent Hashing Functions and Their Characteristics," Proc. SIGMOD Conf., pp. 224236,San Jose, Calif., 1975.
[10] R. Duda and P. Hart, Pattern Classification and Scene Analysis.New York: John Wiley and Sons, 1973.
[11] C. Ellis, "Distributed Data Structures: A Case Study," IEEE Trans. Computers, vol. 34, no. 12, pp. 1,1781,185, Dec. 1985.
[12] C. Faloutsos, "Gray Codes for Partial Match and Range Queries," IEEE Trans. Software Eng., vol. 14, no. 10, pp. 1,3811,393, Oct. 1987.
[13] C. Faloutsos and I. Kamel, “Beyond Uniformity and Independence: Analysis of RTrees Using the Concept of Fractal Dimension,” Proc. 13th ACM Symp. Principles of Database Systems (PODS), 1994.
[14] K.K. Goswami, M. Devarakonda, and R.K. Iyer, "PredictionBased Dynamic LoadSharing Heuristics," IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 6, pp. 638648, June 1993.
[15] O. Guenther and A. Buchmann, "Research Issues in Spatial Databases," ACM SIGMOD Record, vol. 19, no. 4, pp. 6167, 1990.
[16] A. Gueziec and N. Ayache,“Smoothing and matching of 3D space curves,” Second European Conf. Computer Vision, pp. 620629,Santa Margherita Ligure, Italy, SpringerVerlag, 1992.
[17] A. Guttman, “RTrees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Conf. Management of Data, 1984.
[18] J.A Hartigan,Clustering Algorithms, John Wiley and Sons, New York, N.Y., 1975.
[19] T. Honishi, T. Satoh, and U. Inoue, "An Index Structure for Parallel Database Processing," IEEE Second Int'l Workshop Research Issues on Data Eng., pp. 224225, 1992.
[20] Y. Ioannidis, "Universality of Serial Histograms," Proc. 19th VLDB Conf.,Dublin, Ireland, 1993.
[21] H.V. Jagadish, "Linear Clustering of Objects with Multiple Attributes," Proc. Int'l Conf. Management of Data, pp. 332342, ACM SIGMOD, 1990.
[22] T. Johnson, P. Krishna, and A. Colbrook, "Distributed Indices for Accessing Distributed Data," Proc. 12th IEEE Symp. Mass Storage Systems, pp. 199207, 1993.
[23] I. Kamel and C. Faloutsos, "Parallel RTrees," Proc. ACM SIGMOD Conf., pp. 195204, 1992.
[24] G. Knott, "Hashing Functions," The Computer J., vol. 18, no. 3, Mar. 1975.
[25] O. Kremien and J. Kramer, "Methodical Analysis of Adaptive Load Sharing Algorithms," IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 11, pp. 747760, Nov. 1992.
[26] B. Kroll and P. Widmayer, "Distributing a Search Structure Among a Growing Number of Processors," Proc. ACM SIGMOD Conf., pp. 265276, 1994.
[27] Y. Lamdan and H.J. Wolfson, "Geometric hashing: A general and efficient modelbased recognition scheme," Second Int'l Conf. Computer Vision, pp. 238249, 1988.
[28] P. Lehman and S. Yao, "Efficient Locking for Concurrent Operations on BTrees," ACM Trans. Database Systems, vol. 6, no. 4, pp. 650670, 1981.
[29] M. Livny and M. Melman, "Load Balancing in Homogeneous Broadcast Distributed Systems," Proc. ACM Computer Network Performance Symp., pp. 4755, 1982.
[30] G. Matsliach and O. Shmueli, "An Efficient Method for Distributing Search Structures," Proc. First Int'l Conf. Parallel and Distributed Information Systems, pp. 159166, 1991.
[31] R. Mohan, D. Weinshall, and R. Sarukkai, "3D Object Recognition by Indexing Structural Invariants from Multiple Views," Proc. Int'l Conf. Computer Vision, pp. 264268,Berlin, May 1993.
[32] J. Nievergelt, H. Hinterberger, and K.C. Sevcik, "The Grid File: An Adaptable, Symmetric Multikey File Structure," ACM Trans. Database Systems, vol. 9, no. 1, pp. 3871, Mar. 1984.
[33] R. Nussinov and H.J. Wolfson, "Efficient Detection of ThreeDimensional Motifs in Biological Macromolecules by Computer Vision Techniques," Proc. Nat'l Academy Science USA, vol. 88, pp. 10,49510,499, 1991.
[34] A. Papoulis, Probability, Random Variables, and Stochastic Processes.New York: Mc GrawHill, 1984.
[35] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C.Cambridge, England: Cambridge Univ. Press, 1988.
[36] M.V. Ramakrishna, "Hashing in Practice, Analysis of Hashing and Universal Hashing," Proc. 1988 ACMSIGMOD Conf. Management of Data,Chicago, 1988.
[37] M.V. Ramakrishna, E. Fu, and E. Bahcekapili, "Efficient Hardware Hashing Functions for High Performance Computers," IEEE Trans. Computers, vol. 46, no. 12, pp. 1,3781,381, Dec. 1997.
[38] I. Rigoutsos, "WellBehaved, Tunable 3D AffineInvariants," Proc. IEEE Computer Vision and Pattern Recognition Conf.,Santa Barbara, Calif., June 1998.
[39] I. Rigoutsos and A. Califano, "Searching in Parallel for Similar Strings," IEEE Computational Science and Eng., pp. 6075, Summer 1994.
[40] I. Rigoutsos, D. Platt, A. Califano, and B.D. Silverman, "Representation and Matching of Small Flexible Molecules in Large Databases of 3DMolecular Information," Pattern Discovery in Molecular Biology, J. Wang, B. Shapiro, and D. Shasha, eds. Oxford Univ. Press, to appear.
[41] R.N. Sahner, K.S. Trivedi, and A. Puliafito, Performance and Reliability Analysis of Computer Systems: An ExampleBased Approach Using the Sharpe Software Package. Kluwer Academic, 1995.
[42] B. Salzberg, "Practical Spatial Database Access Methods," IEEE Symp. Applied Computing, pp. 8290, 1991.
[43] T. Sellis, N. Roussopoulos, and C. Faloutsos, “The R+Tree: A Dynamic Index for Multidimensional Objects,” Proc. 13th Int'l Conf. Very Large Data Bases (VLDB), 1987.
[44] A. Svensson, "History, and Intelligent Load Sharing Filter," Proc. 10th Int'l Conf. Distributed Computing Systems, 1991.
[45] X. Wang, J. Wang, D. Shasha, B. Shapiro, S. Dikshitulu, I. Rigoutsos, and K. Zhang, "Automated Discovery of Active Motifs in Three Dimensional Molecules," Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD '97),Newport Beach, Calif., Aug. 1997.
[46] G. Wiederhold, Database Design, Computer Science Series, second ed. New York: McGraw Hill, 1983.
[47] M. WillebeckLeMair and A. Reeves, “Strategies for Dynamic Load Balancing on Highly Parallel Computers,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 9, pp. 979993, Sept. 1993.
[48] S. Zhou, M. Stumm, K. Li, and D. Wortman, "Heterogeneous Distributed Shared Memory," IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 5, pp. 540554, Sept. 1992.
[49] G.K. Zipf, Human Behavior and the Principle of Least Effort.Reading, Mass.: AddisonWesley, 1949.