This Article 
 Bibliographic References 
 Add to: 
Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors
September 1997 (vol. 8 no. 9)
pp. 943-958

Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several shared-memory multiprocessors consist of more memory banks than processors. The object of this paper is to provide a simple model (with only a few parameters) for the design and analysis of irregular parallel algorithms that will give a reasonable characterization of performance on such machines. For this purpose, we extend Valiant's bulk-synchronous parallel (BSP) model with two parameters: a parameter for memory bank delay, the minimum time for servicing requests at a bank, and a parameter for memory bank expansion, the ratio of the number of banks to the number of processors. We call this model the (d, x)-BSP. We show experimentally that the (d, x)-BSP captures the impact of bank contention and delay on the CRAY C90 and J90 for irregular access patterns, without modeling machine-specific details of these machines. The model has clarified the performance characteristics of several unstructured algorithms on the CRAY C90 and J90, and allowed us to explore tradeoffs and optimizations for these algorithms. In addition to modeling individual algorithms directly, we also consider the use of the (d, x)-BSP as a bridging model for emulating a very high-level abstract model, the Parallel Random Access Machine (PRAM). We provide matching upper and lower bounds for emulating the EREW and QRQW PRAMs on the (d, x)-BSP.

[1] A. Agarwal, J. Kubiatowicz, D. Kranz, B. Lim, D. Yeung, G. D'Souza, and M. Parkin, "Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors," IEEE Micro, vol. 13, no. 3, pp. 48-61, June 1993.
[2] R. Alverson et al., "The Tera Computer System," Proc. Int'l Conf. Supercomputing, Assoc. of Computing Machinery, N.Y., 1990, pp. 1-6.
[3] D.A. Bader and J. JàJà, "Practical Parallel Algorithms for Dynamic Data Redistribution, Median Finding, and Selection," Proc. 10th Int'l Parallel Processing Symp., pp. 292-301, Apr. 1996.
[4] D.H. Bailey,“Vector computer memory bank contention,” IEEE Trans. Computers, vol. 36, pp. 293-298, 1987.
[5] F. Baskett and A.J. Smith, "Interference in Multiprocessor Computer Systems with Interleaved Memory," Comm. ACM, vol. 19, no. 6, pp. 327-334, June 1976.
[6] R.H. Bisseling and W.F. McColl, "Scientific Computing on Bulk Synchronous Parallel Architectures," Proc. 13th IFIP World Computer Congress, pp. 509-514, 1994.
[7] G.E. Blelloch, M.A. Heroux, and M. Zagha, "Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors," Technical Report CMU-CS-93-173, School of Computer Science, Carnegie Mellon Univ., Aug. 1993.
[8] G.E. Blelloch, C.E. Leiserson, B.M. Maggs, C.G. Plaxton, S. Smith, and M. Zagha, "A Comparison of Sorting Algorithms for the Connection Machine CM-2," Proc. Third Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 3-16, July 1991.
[9] F.A. Briggs and E.S. Davidson, "Organization of Semiconductor Memories for Parallel Pipelined Processors," IEEE Trans. Computers, vol. 26, no. 2, pp. 162-169, Feb. 1977.
[10] I.Y. Bucher and Simmons, "Measurement of Memory Access Contentions in Multiple Vector Processors," Proc. Supercomputing '91, pp. 806-817, 1991.
[11] D.A. Calahan, "Some Results in Memory Conflict Analysis," Proc. Supercomputing '89, pp. 775-778, Nov. 1989.
[12] L.J. Carter and M.N. Wegman, "Universal Classes of Hash Functions," J. Computer and System Sciences, vol. 18, pp. 143-154, 1979.
[13] D.Y. Chang, D.J. Kuck, and D.H. Lawrie, "On the Effective Bandwidth of Parallel Memories," IEEE Trans. Computers, vol. 26, no. 5, pp. 480-489, May 1977.
[14] T. Cheatham, A. Fahmy, D.C. Stefanescu, and L.G. Valiant, “Bulk Synchronous Parallel Computing—A Paradigm for Transportable Doftware,” Proc. 28th Hawaii Int'l Conf. System Science, vol. II, Jan. 1995.
[15] T. Cheung and J.E. Smith,“A simulation study of the Cray X-MP memorysystem,” IEEE Trans. Computers, vol. 35, pp. 613-622, 1986.
[16] D. Culler,R. Karp,D. Patterson,A. Sahay,K.E. Schauser,E. Santos,R. Subramonian,, and T. von Eicken,“LogP: Towards a realistic model of parallel computation,” Fourth Symp. Principles and Practices Parallel Programming, SIGPLAN’93, ACM, May 1993.
[17] D.E. Culler, A. Dusseau, R. Martin, and K.E. Schauser, "Fast Parallel Sorting Under LogP: From Theory to Practice," Proc. Workshop Portability and Performance for Parallel Processing,Southhampton, England, July 1993.
[18] U. Detert and G. Hofemann, "CRAY X-MP and Y-MP Memory Performance," Parallel Computing, vol. 17, pp. 579-590, 1991.
[19] M. Dietzfelbinger, J. Gil, Y. Matias, and N. Pippenger, "Polynomial Hash Functions Are Reliable," Proc. 19th Int'l Colloquium Automata Languages and Programming, Springer LNCS 623, pp. 235-246, July 1992.
[20] M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen, "A Reliable Randomized Algorithm for the Closest-Pair Problem," Technical Report Research Report 513, Universitat Dortmund, Dec. 1993.
[21] C. Engelmann and J. Keller, "Simulation-Based Comparison of Hash Functions for Emulated Shared Memory," Proc. Parallel Architectures and Languages Europe, Springer LNCS 694, pp. 1-11, June 1993.
[22] A.V. Gerbessiotis and C.J. Siniolakis, “Deterministic Sorting and Randomized Mean Finding on the BSP Model,” Proc. Eighth Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 223-232, June 1996.
[23] P.B. Gibbons, Y. Matias, and V. Ramachandran, "Efficient Low-Contention Parallel Algorithms," J. Computer and System Sciences, vol. 53, no. 3, pp. 417-442, 1996.
[24] P.B. Gibbons, Y. Mattias, and V. Ramachandran, “Can a Shared-Memory Model Serve as a Bridging Model for Parallel Computation?,” Proc. Ninth Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 72-83, Newport, R.I., June 1997.
[25] P. Gibbons, Y. Matias, and V. Ramachandran, “The QRQW PRAM: Accounting for Contention in Parallel Algorithms,” Proc. Fifth Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 638-648, Jan. 1994.
[26] M. Goudreau, K. Lang, S. Rao, T. Suel, and T. Tsantilas, "Towards Efficiency and Portability: Programming with the BSP Model," Proc. Eighth ACM Symp. Parallel Algorithms and Architectures, pp. 1-12, June 1996.
[27] J. Greiner, "A Comparison of Data-Parallel Algorithms for Connected Components," Proc. Sixth ACM Symp. Parallel Algorithms and Architectures, pp. 16-25, June 1994.
[28] D.T. Harper III,“Block, multistride vector and FFT accesses in parallel memorysystems,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 1, pp. 43-51, 1991.
[29] D.T. Harper III and Y. Costa, "Analytical Estimation of Vector Access Performance in Parallel Memory Architectures," IEEE Trans. Computers, vol. 42, no. 5, pp. 616-624, May 1993.
[30] D.R. Helman, D. Bader, and J. JáJá, “Parallel Algorithms for Personalized Communication and Sorting with an Experimental Study,” Proc. ACM Symp. Parallel Algorithms and Architectures, pp. 211–220, 1996.
[31] W.-C. Hsu and J.E. Smith, “Performance of Cached DRAM Organizations in Vector Supercomputers,” Proc. 20th Ann. Int'l Symp. Computer Architecture (ISCA '93), pp. 327-336, May 1993.
[32] J. J'aJ'a, An Introduction to Parallel Algorithms.New York: Addison-Wesley, 1992.
[33] A.R. Karlin and E. Upfal, "Parallel Hashing: An Efficient Implementation of Shared Memory," J. ACM, vol. 35, no. 4, pp. 876-892, 1988.
[34] R.M. Karp and V. Ramachandran, "Parallel Algorithms for Shared-Memory Machines," Handbook of Theoretical Computer Science, J. van Leeuwen, ed., vol. A, pp. 869-941.Amsterdam: NorthHolland, 1990.
[35] D.E. Knuth, Sorting and Searching, vol. 3, The Art of Computer Programming. Reading, Mass.: Addison-Wesley, 1973.
[36] F.T. Leighton, "Methods for Message Routing in Parallel Machines," Proc. 24th ACM Symp. Theory of Computing, pp. 77-96, May 1992.
[37] K. Mehlhorn and U. Vishkin, "Randomized and Deterministic Simulations of PRAMs by Parallel Machines with Restricted Granularity of Parallel Memories," Acta Informatica, vol. 21, pp. 339-374, 1984.
[38] R. Miller, "A Library for Bulk-Synchronous Parallel Programming," Proc. British Computer Society Parallel Processsing, Specialist Group Workshop General Purpose Parallel Computing, Dec. 1993.
[39] W. Oed and O. Lange,“On the effective bandwidth of interleaved memories invector processing systems,” IEEE Trans. Computers, vol. 34, no. 10, pp. 949-957, Oct. 1985.
[40] P. Raghavan, "Probabilistic Construction of Deterministic Algorithms: Approximating Packing Integer Programs," J. Computer and System Sciences, vol. 37, pp. 130-143, 1988.
[41] R. Raghavan and J.P. Hayes, "On Randomly Interleaved Memories," Proc. Supercomputing '90, pp. 49-58, Nov. 1990.
[42] A.G. Ranade, "How to Emulate Shared Memory," J. Computer and System Sciences, vol. 42, pp. 307-326, 1991.
[43] B.R. Rau,“Pseudo-randomly interleaved memory,” Int’l Symp. Computer Architecture, pp. 74-83, 1991.
[44] J.H. Reif and L.G. Valiant, "A Logarithmic Time Sort for Linear Size Networks," J. ACM, vol. 34, pp. 60-76, Jan. 1987.
[45] S. Saini and D.H. Bailey, "NAS Parallel Benchmark Results 10-95," Technical Report NAS-95-019, NASA Ames Research Center, Oct. 1995.
[46] B.J. Smith, "A Pipelined, Shared Resource MIMD Computer," Proc. Int'l Conf. Parallel Processing, Aug. 1978.
[47] J.E. Smith and W.R. Taylor,“Accurate modeling of interconnection networks in vector supercomputers,” 1991 Int’l Conf. Supercomputing, pp. 264-273, 1991.
[48] J.E. Smith and W.R. Taylor, "Characterizing Memory Performance in Vector Multiprocessors," Proc. Int'l Conf. Supercomputing, pp. 35-44, July 1992.
[49] G.S. Sohi,“High-bandwidth interleaved memories for vector processors—Asimulation study,” IEEE Trans. Computer Systems, vol. 42, pp. 34-44, 1993.
[50] R.H. Swendsen and J.-S. Wang, "Nonuniversal Critical Dynamics in Monte Carlo Simulations," Physical Review Letters, vol. 58, no. 2, pp. 86-88, Jan. 1987.
[51] K. Thearling and S. Smith, "An Improved Supercomputer Sorting Benchmark," Proc. Supercomputing '92, pp. 14-19, Nov. 1992.
[52] T. Uehara and T. Tsuda, "Benchmarking Vector Indirect Load/Store Instructions," Supercomputer, vol. 8, no. 6, pp. 57-74, Nov. 1991.
[53] L.G. Valiant, “A Bridging Model for Parallel Computation,” Comm. ACM, vol. 33, no. 8, pp. 103-111, Aug. 1990.
[54] M. Zagha, "Efficient Irregular Computation on Pipelined-Memory Multiprocessors," PhD thesis, in preparation, 1997.
[55] M. Zagha and G. Blelloch, "Radix Sort for Vector Multiprocessors," Supercomputing, 1991.

Index Terms:
Memory bank contention, memory delays, parallel machine models, performance analysis, parallel algorithms, shared memory, multiprocessors.
Guy E. Blelloch, Phillip B. Gibbons, Yossi Matias, Marco Zagha, "Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 9, pp. 943-958, Sept. 1997, doi:10.1109/71.615440
Usage of this product signifies your acceptance of the Terms of Use.