This Article 
 Bibliographic References 
 Add to: 
Memory Hierarchy Considerations for Cost-Effective Cluster Computing
September 2000 (vol. 49 no. 9)
pp. 915-933

Abstract—Using off-the-shelf commodity workstations and PCs to build a cluster for parallel computing has become a common practice. The cost-effectiveness of a cluster computing platform for a given budget and for certain types of applications is mainly determined by its memory hierarchy and the interconnection network configurations of the cluster. Finding such a cost-effective solution from exhaustive simulations would be highly time-consuming and predictions from measurements on existing clusters would be impractical. We present an analytical model for evaluating the performance impact of memory hierarchies and networks on cluster computing. The model covers the memory hierarchy of a single SMP, a cluster of workstations/PCs, or a cluster of SMPs by changing various architectural parameters. Network variations covering both bus and switch networks are also included in the analysis. Different types of applications are characterized by parameterized workloads with different computation and communication requirements. The model has been validated by simulations and measurements. The workloads used for experiments are both scientific applications and commercial workloads. Our study shows that the depth of the memory hierarchy is the most sensitive factor affecting the execution time for many types of workloads. However, the interconnection network cost of a tightly coupled system with a short depth in memory hierarchy, such as an SMP, is significantly more expensive than a normal cluster network connecting independent computer nodes. Thus, the essential issue to be considered is the trade-off between the depth of the memory hierarchy and the system cost. Based on analyses and case studies, we present our quantitative recommendations for building cost-effective clusters for different workloads.

[1] G.A. Abandah and E.S. Davidson, “Configuration Independent Analysis for Characterizing Shared-Memory Applications,” Proc. 12th Int'l Parallel Processing Symp., pp. 485-491, Apr. 1998.
[2] D. Bailey et al., “The NAS Parallel Benchmarks,” Int'l J. Supercomputing Applications, vol. 5, no. 3, pp. 63-73, Fall 1991.
[3] F. Bergholm, “Edge Focusing,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 9, no. 6, pp. 726-741, June 1987.
[4] D. Bhandarkar and J. Ding, “Performance Characterization of the Pentium Pro Processor,” Proc. Third Int'l Symp. High Performance Computer Architecture (HPCA'97), pp. 288-297, Feb. 1997.
[5] C.K. Chow, “On Optimization of Storage Hierarchies” IBM J. Research and Development, pp. 194-203, May 1974.
[6] E.G. Coffman and P.J. Denning,Operating Systems Theory, Prentice-Hall Inc., Englewood Cliffs, N.J., 1973.
[7] R. Cypher, A. Ho, S. Konstantinidou, and P. Messina, "Architectural Requirements of Parallel Scientific Applications with Explicit Communication," Proc. 20th Ann. Int'l Symp. Computer Architecture, pp. 2-13, May 1993.
[8] J.J. Dongarra, J.R. Bunch, C.B. Moler, and G.W. Stewart, LINPACK User's Guide. Philadelphia: SIAM, Philadelphia, 1979.
[9] X. Du and X. Zhang, “Performance Models and Simulation,” High Performance Cluster Computing, R. Buyya, ed., vol. 1, chapter 1, pp. 135-153. Upper Saddle River, N.J.: Prentice Hall, 1999.
[10] M. Heinrich et al., “The Performance Impact of Flexibility in the Stanford FlashMultiprocessor,” Proc. Sixth Int’l Conf. Architectural Support for Programming Languages and OperatingSystems, IEEE Computer Society Press, Los Alamitos, Calif., 1994, pp. 274-284.
[11] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[12] K. Keeton et al., "Performance Characterization of a Quad Pentium Pro SMP Using OLTP Workloads," Proc. 25th Int'l Symp. Computer Architecture, CS Press, Los Alamitos, Calif., 1998, pp. 15-26.
[13] K. Kennedy, C.F. Bender, J.W.D. Connolly, J.L. Hennessy, M.K. Vernon, and L. Smarr, “A National Parallel Computing Environment,” Comm. ACM, vol. 40, no. 11, pp. 63-72, 1997.
[14] High Performance Fortran Forum, High Performance Fortran Language Specification Version 1.0, Draft, Jan. 1993.
[15] B.L. Jacob, P.M. Chen, S.R. Silverman, and T.N. Mudge, “An Analytical Model for Designing Memory Hierarchies,” IEEE Trans. Computers, vol. 45, no. 10, pp. 1180-1194, Oct. 1996.
[16] L. McVoy and C. Staelin, “Inbench: Portable Tools for Performance Analysis,” Proc. 1996 USENIX Technical Conf., pp. 279-295, Jan. 1996.
[17] S.M. Ross, Introduction to Probability Models, sixth ed. San Diego: Academic Press, 1997.
[18] R. Samanta et al., “Home-Based SVM Protocols for SMP Clusters: Design, Simulation, Implementation and Performance,” Proc. Fourth Int'l Symp. High Performance Computer Architecture, Feb. 1998.
[19] J.P. Singh, W.D. Weber, and A. Gupta, "SPLASH: Stanford Parallel Applications for Shared Memory," Proc. 19th Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., May 1992, pp. 5-14.
[20] A.J. Smith, "Cache Memories," ACM Computing Surveys, Vol. 14, 1982, pp. 473-540.
[21] R. Stets et al., “CASHMERE-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network,” Proc. 16th ACM Symp. Operating Systems Principles, Oct. 1997.
[22] H.S. Stone, High Performance Computer Architecture. Addison-Wesley, 1993.
[23] Transaction Processing Performance Council, TPC Benchmark C, TPC Benchmark C Standard Specification, Revision 3.3.3, Apr. 1998.
[24] K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Prentice Hall, 1982.
[25] VeenstraJ.E. and R.J. Fowler, "MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors," Proc. Second Int'l Workshop Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, IEEE Computer Society Press, Los Alamitos, Calif., ISBN 0-8186-5292-6, Jan. 1994, p. 201.
[26] T.A. Welch, “Memory Hierarchy Configuration Analysis,” IEEE Trans. Computers, vol. 27, no. 5, pp. 408-415, May 1978.
[27] S.C. Woo et al., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., June 1995, pp. 24-36.
[28] X. Zhang, S.G. Dykes, and H. Deng, “Distributed Edge Detection: Issues and Implementations,” IEEE Computational Science&Engineering, pp. 72-82, Spring 1997.
[29] Y. Zhou et al., “Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation,” Proc. Sixth ACM Symp. Principles and Practice of Parallel Programming, June 1997.

Index Terms:
Clusters, cost model, memory hierarchy, performance evaluation, SMP, workstations.
Xing Du, Xiaodong Zhang, Zhichun Zhu, "Memory Hierarchy Considerations for Cost-Effective Cluster Computing," IEEE Transactions on Computers, vol. 49, no. 9, pp. 915-933, Sept. 2000, doi:10.1109/12.869323
Usage of this product signifies your acceptance of the Terms of Use.