The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2011 vol.60)
pp: 472-483
Thomas J. Ashby , IMEC, Leuven
Pedro Díaz , University of Edinburgh, Edinburgh
Marcelo Cintra , University of Edinburgh, Edinburgh
ABSTRACT
Implementing shared memory consistency models on top of hardware caches gives rise to the well-known cache coherence problem. The standard solution involves implementing coherence protocols in hardware, an approach with some design complexity, hardware costs, and restrictions on interconnect behavior. However, for some memory consistency models, an alternative is to enforce coherence in the software implementation of synchronization primitives, using software controlled invalidations and forced writebacks. This requires minimal hardware support but gives less selective enforcement, which affects performance. This paper proposes a novel hybrid software-hardware coherence mechanism. In this scheme, software is responsible for triggering the coherence actions—self-invalidations and writebacks—at appropriate times while hardware uses Bloom filters to perform more selective self-invalidations. We evaluate the proposed scheme on applications from two different domains: the SPLASH-2 scientific and ALP multimedia benchmarks. Experimental results show that while the software-only coherence scheme shows less performance degradation than expected, it still unacceptably degrades performance for some of the benchmarks. Filtering out unnecessary invalidations improves the worst-case performance by as much as 93 percent, and brings the performance of the hybrid scheme within five percent of full hardware coherence for 10 out of 13 benchmarks, on a 32-core CMP with a shared L2 cache.
INDEX TERMS
Multiprocessors, cache coherence.
CITATION
Thomas J. Ashby, Pedro Díaz, Marcelo Cintra, "Software-Based Cache Coherence with Hardware-Assisted Selective Self-Invalidations Using Bloom Filters", IEEE Transactions on Computers, vol.60, no. 4, pp. 472-483, April 2011, doi:10.1109/TC.2010.155
REFERENCES
[1] D. Abts, S. Scott, and D.J. Lilja, "So Many States, so Little Time: Verifying Memory Coherence in the Cray X1," Proc. 17th Int'l Symp. Parallel and Distributed Processing, Apr. 2003.
[2] S.V. Adve, V.S. Adve, M.D. Hill, and M.K. Vernon, "Comparison of Hardware and Software Cache Coherence Schemes," Proc. 18th Ann. Int'l Symp. Computer Architecture, pp. 298-308, May 1991.
[3] S.V. Adve and K. Gharachorloo, "Shared Memory Consistency Models: A Tutorial," Computer, vol. 29, no. 12, pp. 66-76, Dec. 1996.
[4] E. Allen, D. Chase, V. Luchango, J.-W. Maessen, S. Ryu, G.L. Steele,Jr., and S. Tobin-Hochstadt, "The Fortress Language Specification. Version 0.618.," http://research.sun.com/projects/ plrgfortress0618.pdf ., 2010.
[5] S. Bell et al., "TILE64 Processor: A 64-Core SoC with Mesh Interconnect," Proc. Int'l. Conf. Solid-State Circuits (ISSCC '08), pp. 88-89, Feb. 2008.
[6] J.K. Bennett, J.B. Carter, and W. Zwaenepoel, "Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence," Proc. Second ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, Mar. 1990.
[7] B. Bloom, "Space/Time Trade-Offs in Hash Coding with Allowable Errors," Comm. ACM, vol. 13, no. 7, pp. 422-426, July 1970.
[8] L. Ceze, J. Tuck, C. Cascaval, and J. Torrellas, "Bulk Disambiguation of Speculative Threads in Multiprocessors," Proc. Int'l. Symp. Computer Architecture, pp. 227-238, June 2006.
[9] L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas, "Bulk Enforcement of Sequential Consistency," Proc. Int'l. Symp. Computer Architecture, pp. 278-289, June 2007.
[10] A. Charlesworth, "The Sun Fireplane Interconnect," IEEE Micro, vol. 22, no. 1, pp. 36-45, Jan./Feb. 2002.
[11] H. Cheong and A.V. Veidenbaum, "A Cache Coherence Scheme with Fast Selective Invalidation," Proc. Int'l. Symp. Computer Architecture, pp. 299-307, June 1988.
[12] H. Cheong and A. Veidenbaum, "A Version Control Approach to Cache Coherence," Proc. Int'l. Conf. Supercomputing, pp. 322-330, June 1989.
[13] H. Cheong, "Life Span Strategy—A Compiler-Based Approach to Cache Coherence," Proc. Int'l. Conf. Supercomputing, pp. 139-148, June 1992.
[14] R. Cytron, S. Karlovsky, and K.P. McAuliffe, "Automatic Management of Programmable Caches," Proc. Int'l. Conf. Parallel Processing, pp. 229-238, Aug. 1988.
[15] Cray Inc. "Chapel Language Specification 0.750.," http://chapel. cs.washington.eduspec-0.750.pdf , 2010.
[16] E. Darnell and K. Kennedy, "Cache Coherence Using Local Knowledge," Proc. Int'l. Conf. Supercomputing, pp. 720-729, June 1993.
[17] M.D. Hill, "Multiprocessors Should Support Simple Memory Consistency Models," Computer, vol. 31, no. 8, pp. 28-34, Aug. 1998.
[18] J. Howard et al., "A 48-Core IA-32 Message-Passing Processor with DVFS in 45 nm CMOS," Proc. Int'l. Solid-State Circuits Conf., Feb. 2010.
[19] R. Kalla, B. Sinharoy, and J.M. Tendler, "IBM Power5 Chip: A Dual-Core Multithreaded Processor," IEEE Micro, vol. 24, no. 2, pp. 40-47, Mar./Apr. 2004.
[20] P. Keleher, A.L. Cox, and W. Zwaenepoel, "Lazy Release Consistency for Software Distributed Shared Memory," Proc. Int'l. Symp. Computer Architecture, pp. 13-21, May 1992.
[21] P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005.
[22] L. Kontothanassis, G. Hunt, R. Stets, N. Hardavellas, M. Cierniak, S. Parthasarathy, W. Meira,Jr, S. Dwarkadas, and M.L. Scott, "VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks," Proc. Int'l. Symp. Computer Architecture, pp. 157-169, June 1997.
[23] R. Kumar, V. Zyuba, and D.M. Tullsen, "Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads, and Scaling," Proc. Int'l. Symp. Computer Architecture, pp. 408-419, June 2005.
[24] A.-C. Lai and B. Falsafi, "Selective, Accurate, and Timely Self-Invalidation Using Last-Touch Prediction," Proc, Int'l. Symp. Computer Architecture, pp. 139-148, June 2000.
[25] J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA Highly Scalable Server," Proc. Int'l. Symp. Computer Architecture, pp. 241-251, June 1997.
[26] A.R. Lebeck and D.A. Wood, "Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors," Proc. Int'l. Symp. Computer Architecture, pp. 48-59, June 1995.
[27] M.-L. Li, R. Sasanka, S.V. Adve, Y.-K. Chen, and E. Debes, "The ALPBench Benchmark Suite for Complex Multimedia Applications," Proc. Int'l. Symp. Workload Characterization, pp. 34-45, Oct. 2005.
[28] D. Lie, A. Chou, D. Engler, and D.L. Dill, "A Simple Method for Extracting Models from Protocol Code," Proc. Int'l. Symp. Computer Architecture, pp. 192-203, June 2001.
[29] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A Full System Simulation Platform," Computer, vol. 35, no. 2, pp. 50-58, Feb. 2002.
[30] J.M. Mellor-Crummey and M.L. Scott, "Algorithms for Scalable Synchronization of Shared-Memory Multiprocessors," ACM Trans. Computer Systems, vol. 9, no. 1, pp. 21-65, Feb. 1991.
[31] S.L. Min and J.-L. Baer, "A Timestamp-Based Cache Coherence Scheme," Proc. Int'l. Conf. Parallel Processing, pp. 23-32, Aug. 1989.
[32] C.C. Minh, M. Trautmann, J.-W. Chung, A. McDonald, N. Bronson, J. Casper, C. Kozyrakis, and K. Olukotun, "An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees," Proc. Int'l. Symp. Computer Architecture, pp. 69-80, June 2007.
[33] M.F.P. O'Boyle, R.W. Ford, and E.A. Stöhr, "Towards General and Exact Distributed Invalidation," J. Parallel and Distributed Computing, vol. 63, no. 11, pp. 1123-1137, Nov. 2003.
[34] S. Owicki and A. Agarwal, "Evaluating the Performance of Software Cache Coherence," Proc. Int'l. Conf. Architectural Support for Programming Languages and Operating Systems, pp. 230-242, Apr. 1989.
[35] G.F. Pfister, W.C. Brantley, D.A. George, S.L. Harvey, W.J. Kleinfelder, K.P. McAuliffe, E.S. Melton, V.A. Norton, and J. Weiss, "The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture," Proc. Int'l. Conf. Parallel Processing, pp. 764-771, Aug. 1985.
[36] S.K. Reinhardt, J.R. Larus, and D.A. Wood, "Tempest and Typhoon: User-Level Shared Memory," Proc. Int'l. Symp. Computer Architecture, pp. 325-336, June 1994.
[37] H. Sandhu, B. Gamsa, and S. Zhou, "The Shared Regions Approach to Software Cache Coherence on Multiprocessors," Proc. Fourth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 229-238, May 1993.
[38] A.J. Smith, "CPU Cache Consistency with Software Support and Using One Time Identifiers," Technical Report CSD-86-290, Apr. 1986.
[39] J. Tuck, W. Ahn, L. Ceze, and J. Torrellas, "SoftSig: Software-Exposed Hardware Signatures for Code Analysis and Optimization," Proc. Int'l. Conf. Architectural Support for Programming Languages and Operating Systems, pp. 145-156, Mar. 2008.
[40] A. Veidenbaum, "A Compiler-Assisted Cache Coherence Solution for Multiprocessors," Proc. Int'l. Conf. Parallel Processing, pp. 1029-1036, Aug. 1986.
[41] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. Int'l. Symp. Computer Architecture, pp. 24-36, June 1995.
[42] L. Yen, J. Bobba, M.R. Marty, K.E. Moore, H. Volos, M.D. Hill, M.M. Swift, and D.A. Wood, "LogTM-SE: Decoupling Hardware Transactional Memory from Caches," Proc. Int'l. Symp. High-Performance Computer Architecture, pp. 261-272, Feb. 2007.
24 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool