This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Sort vs. Hash Revisited
December 1994 (vol. 6 no. 6)
pp. 934-944

Efficient algorithms for processing large volumes of data are very important both for relational and new object-oriented database systems. Many query-processing operations can be implemented using sort- or hash-based algorithms, e.g. intersections, joins, and duplicate elimination. In the early relational database systems, only sort-based algorithms were employed. In the last decade, hash-based algorithms have gained acceptance and popularity, and are often considered generally superior to sort-based algorithms such as merge-join. In this article, we compare the concepts behind sort- and hash-based query-processing algorithms and conclude that (1) many dualities exist between the two types of algorithms, (2) their costs differ mostly by percentages rather than by factors, (3) several special cases exist that favor one or the other choice, and (4) there is a strong reason why both hash- and sort-based algorithms should be available in a query-processing system. Our conclusions are supported by experiments performed using the Volcano query execution engine.

[1] A. Aggarval and J. S. Vitter, "The input/output complexity of sorting and related problems,"Commun. ACMvol. 31, p. 1116, Oct. 1988.
[2] M. M. Astrahanet al., "System R: Relational approach to database management,"Trans. Database Syst., vol. 1, no. 1, pp. 97-137, 1976.
[3] M. Beck, D. Bitton, and W. K. Wilkinson, "Sorting large files on a backend multiprocessor,"IEEE Trans. Comput., vol. 37, p. 769, 1988.
[4] D. Bitton and D. J. Dewitt, "Duplicate record elimination in large data files,"ACM Trans. Database Syst., pp. 255-265, June 1983.
[5] D. Bitton Friedland, "Design, analysis, and implementation of parallel external sorting algorithms," Comput. Sci. Tech. Rep. 464, University of Wisconsin--Madison, Jan. 1982.
[6] M. Blasgen and K. Eswaran, "On the evaluation of queries in a relational database system," IBM Res. Rep. RJ-1745, San Jose, CA, USA, Apr. 8, 1976.
[7] M. Blasgen and K. Eswaran, "Storage and access in relational databases,"IBM Syst. J., vol. 16, no. 4, 1977.
[8] K. Bratbergsengen, "Hashing methods and relational algebra operations," inProc. Conf. Very Large Data Bases(Singapore), Aug. 1984, pp. 323-333.
[9] H.-T. Chou, D. J. Dewitt, R. H. Katz, and A. C. Klug, "Design and implementation of the Wisconsin storage system,"Software Practice and Experience, vol. 15, no. 10, pp. 943-962, Oct. 1985.
[10] D. J. DeWittet al., "Implementation techniques for main memory databases," inProc. ACM Sigmod(Boston, MA), June 18-21, 1984, pp. 1-8.
[11] D. J. DeWitt and R. H. Gerber, "Multiprocessor hash-based join algorithms,"Proc. Int. Conf. Very Large Data Bases, 1985, p. 151.
[12] D. Dewitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna, "GAMMA--A high performance dataflow database machine," inProc. 12th Int. Conf. VLDB, Kyoto, Japan, Aug. 1986, pp. 228-237.
[13] D. DeWitt, S. Ghandeharizadeh, and D. Schneider, "A performance analysis of the gamma database machine," inProc. ACM-SIGMOD Int. Conf. Management Data, Chicago, IL, May 1988.
[14] D. J. Dewitt, "The Wisconsin benchmark: Past, present, and future," in J. Gray, Ed.,Database and Transactions Processing Systems Performance Handbook. San Mateo, CA: Morgan Kaufmann, 1991.
[15] D. J. Dewitt, J. Naughton, and D. Schneider, "Parallel sorting on a shared-nothing architecture using probabilistic splitting,"Proc. Int. Conf. Parallel Distrib. Inform. Syst., Miami Beach, FL, USA, Dec. 1991.
[16] R. Epstein, "Techniques for processing of aggregates in relational database systems," UCB/Electron. Res. Lab. Memo. M79/8, Univ. of California, Feb. 1979.
[17] R. Fagan et al., "Extendible Hashing--A Fast-Access Method for Dynamic Files,"ACM Trans. Database Systems, Vol. 4, No. 3, Sept. 1979, pp. 315-344.
[18] S. Fushimi, M. Kitsuregawa, and H. Tanaka, "An overview of the system software of a parallel relational database machine GRACE," inProc. Conf. Very Large Data Bases(Kyoto, Japan), Aug. 1986, pp. 209-219.
[19] G. Graefe, "Encapsulation of parallelism in the Volcano query processing system," inProc. ACM SIGMOD Conf., Atlantic City, NJ, May 1990, p. 102.
[20] G. Graefe and D. L. Davison, "Encapsulation of parallelism architecture-independence in extensible database query processing,"IEEE TRans. Software Eng., vol. 19, no. 8, pp. 747, Aug. 1993.
[21] G. Graefe, "Parallel external sorting in Volcano," Tech. Rep. 459, Univ. of Colorado, Boulder, USA, Dept. Comput. Sci., 1991.
[22] G. Graefe, "Volcano: An extensible and parallel dataflow query processing system,"IEEE Trans. Knowledge. Data Eng., vol. 6, no. 1, pp. 120-135, Feb. 1994.
[23] G. Graefe and S. S. Thakkar, "Tuning a parallel database algorithm on a shared-memory multiprocessor,"Software-Practice and Experience, vol. 22, no. 7, July 1992, p. 485.
[24] L. M. Haas, W. F. Cody, J. C. Freytag, G. Lapis, B. G. Lindsay, G. M. Lohman, K. Ono, and H. Pirahesh, "An extensible processor for an extended relational query language," Comput. Sci. Res. Rep., San Jose, CA, USA, Apr. 1988.
[25] B. R. Iyer and D. M. Dias, "System issues in parallel sorting for database systems,"Proc. IEEE Conf. Data Eng.1990, p. 246.
[26] T. Keller and G. Graefe, "The one-to-one match operator of the Volcano query processing system," Oregon Graduate Center, Comput. Sci. Tech. Rep., Beaverton, OR, USA, June 1989.
[27] A. Kemper and G. Moerkotte, "Access support in object bases," inProc. ACM SIGMOD Conf. Mgmt. of Data, 1990, pp. 364-374.
[28] M. Kitsuregawa, H. Tanaka, and T. Motooka, "Application of hash to data base machine and its architecture,"New Generation Computing, vol. 1, 1983.
[29] M. Kitsuregawa, M. Nakayama, and M. Takagi, "The effect of bucket size tuning in dynamic hybrid GRACE hash join method," inProc. 15th Int. Conf. Very Large Data Bases, 1989, pp. 257-266.
[30] A. Klug, "Access paths in the 'ABE' statistical query facility,"Proc. ACM SIGMOD Conf., 1982, p. 161.
[31] D. E. Knuth,The Art of Computer Programming, Vol. 3, Reading, MA: Addison-Wesley, 1973.
[32] R. Kooi, "The optimization of queries in relational databases," Ph.D. dissert., Case Western Univ., Cleveland, OH, 1980.
[33] R. A. Lorie and H. C. Young, "A low communication sort algorithm for a parallel database machine,"Proc. Int. Conf. Very Large Data Bases, 1989, p. 125.
[34] J. Menon, "A study of sort algorithms for multiprocessor database machines,"Proc. Int. Conf. Very Large Data Bases, 1986, p. 197.
[35] M. Nakayama, M. Kitsuregawa, and M. Takagi, "Hash-partitioned join method using dynamic destaging strategy," inProc. Conf. Very Large Databases(Los Angeles, CA), Aug. 1988, pp. 468-478.
[36] J. Ousterhout, "Why aren't operating systems getting faster as fast as hardware?" WRL Tech. Rep. TN-11, Palo Alto, CA, USA, Oct. 1989.
[37] J. E. Richardson and M. J. Carey, "Programming constructs for database system implementations in EXODUS," in [51], pp. 208-219.
[38] B. Salzberg,File Structures: An Analytic Approach. Englewood Cliffs, NJ: Prentice-Hall, 1988.
[39] B. Salzberg, "Merging sorted runs using large main memory,"Acta Informatica, vol. 27, p. 195, 1990.
[40] B. Salzberg, A. Tsukerman, J. Gray, M. Stewart, S. Uren, and B. Vaughan "FastSort: A distributed single-input single-output external sort,"Proc. ACM SIGMOD Conf., 1990, p. 94.
[41] D. Schneider and D. Dewitt, "A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment," inProc. ACM SIGMOD Conf.(Portland, OR), May-June 1989, p. 110.
[42] P. Selinger,et al., "Access path selection in a relational data base system," inProc. 1979 ACM-SIGMOD Int. Conf. Management of Data, Boston, MA, June 1979.
[43] L. D. Shapiro, "Join processing in database systems with large main memories,"ACM Trans. Database Syst., vol. 11, no. 3, pp. 239-264, Sept. 1986.
[44] M. Stonebraker,et al., "The design and implementation of INGRES,"ACM Trans. Database Syst., vol. 1, no. 3, Sept. 1976.
[45] Teradata Corp., DBC/1012,Data Base Computer, Concepts, and Facilities, Los Angeles, CA, USA, 1983.
[46] S. Todd, "PRTV: An efficient implementation for large relational data bases,"Proc. Int. Conf. Very Large Data Bases, 1975, p. 554.
[47] H. Zeller and J. Gray, "an adaptive hash join algorithm for multiuser environments,"Proc. Int. Conf. Very Large Data Bases, Brisbane, Australia, 1990.
[48] G. K. Zipf,Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Reading, MA: Addison-Wesley, 1949.

Index Terms:
query processing; sorting; file organisation; relational databases; object-oriented databases; database theory; relational database systems; object-oriented database systems; query-processing operations; sort-based algorithms; hash-based algorithms; intersections; joins; duplicate elimination; merge-join algorithm; dualities; costs; Volcano query execution engine; value matching; performance
Citation:
G. Graefe, A. Linville, L.D. Shapiro, "Sort vs. Hash Revisited," IEEE Transactions on Knowledge and Data Engineering, vol. 6, no. 6, pp. 934-944, Dec. 1994, doi:10.1109/69.334883
Usage of this product signifies your acceptance of the Terms of Use.