The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - December (2010 vol.21)
pp: 1779-1792
Alberto Ros , Universidad de Murcia, Murcia
Manuel E. Acacio , Universidad de Murcia, Murcia
José M. García , Universidad de Murcia, Murcia
ABSTRACT
Future many-core CMP designs that will integrate tens of processor cores on-chip will be constrained by area and power. Area constraints make impractical the use of a bus or a crossbar as the on-chip interconnection network, and tiled CMPs organized around a direct interconnection network will probably be the architecture of choice. Power constraints make impractical to rely on broadcasts (as, for example, Token-CMP does) or any other brute-force method for keeping cache coherence, and directory-based cache coherence protocols are currently being employed. Unfortunately, directory protocols introduce indirection to access directory information, which negatively impacts performance. In this work, we present DiCo-CMP, a novel cache coherence protocol especially suited to future many-core tiled CMP architectures. In DiCo-CMP, the task of storing up-to-date sharing information and ensuring ordered accesses for every memory block is assigned to the cache that must provide the block on a miss. Therefore, DiCo-CMP reduces the miss latency compared to a directory protocol by sending requests directly to the cache that provides the block in a cache miss. These latency reductions result in improvements in execution time of up to 6 percent, on average, over a directory protocol. In comparison with Token-CMP, our protocol only sends one request message for each cache miss, as such is able to reduce network traffic by 43 percent.
INDEX TERMS
Many-core CMP, cache coherence protocol, direct coherence, indirection problem, on-chip network traffic.
CITATION
Alberto Ros, Manuel E. Acacio, José M. García, "A Direct Coherence Protocol for Many-Core Chip Multiprocessors", IEEE Transactions on Parallel & Distributed Systems, vol.21, no. 12, pp. 1779-1792, December 2010, doi:10.1109/TPDS.2010.43
REFERENCES
[1] M.E. Acacio, J. González, J.M. García, and J. Duato, "Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in cc-NUMA Multiprocessors," Proc. SC Conf. High Performance Networking and Computing, pp. 1-12, Nov. 2002.
[2] M.E. Acacio, J. González, J.M. García, and J. Duato, "The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors," Proc. 11th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 155-164, Sept. 2002.
[3] N. Agarwal, L.-S. Peh, and N.K. Jha, "In-Network Snoop Ordering (INSO): Snoopy Coherence on Unordered Interconnects," Proc. 15th Int'l Conf. High-Performance Computer Architecture (HPCA), pp. 67-78, Feb. 2009.
[4] A.R. Alameldeen and D.A. Wood, "Variability in Architectural Simulations of Multi-Threaded Workloads," Proc. Ninth Int'l Conf. High-Performance Computer Architecture (HPCA), pp. 7-18, Feb. 2003.
[5] L.A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese, "Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing," Proc. 27th Int'l Symp. Computer Architecture (ISCA), pp. 12-14, June 2000.
[6] B.M. Beckmann, M.R. Marty, and D.A. Wood, "ASR: Adaptive Selective Replication for CMP Caches," Proc. 39th Int'l Symp. Microarchitecture (MICRO), pp. 443-454, Dec. 2006.
[7] M. Björkman, F. Dahlgren, and P. Stenström, "Using Hints to Reduce the Read Miss Penalty for Flat COMA Protocols," Proc. 28th Int'l Conf. System Sciences, pp. 242-251, Jan. 1995.
[8] B.H. Bloom, "Space/Time Trade-Offs in Hash Coding with Allowable Errors," Comm. ACM, vol. 13, pp. 422-426, July 1970.
[9] K.D. Bosschere, W. Luk, X. Martorell, N. Navarro, M. O'Boyle, D. Pnevmatikatos, A. Ramírez, P. Sainrat, A. Seznec, P. Stenström, and O. Temam, "High-Performance Embedded Architecture and Compilation Roadmap," Trans. High-Performance Embedded Architectures and Compilers (HiPEAC), vol. 1, pp. 5-29, Jan. 2007.
[10] J.F. Cantin, J.E. Smith, M.H. Lipasti, A. Moshovos, and B. Falsafi, "Coarse-Grain Coherence Tracking: Regionscout and Region Coherence Arrays," IEEE Micro, vol. 26, no. 1, pp. 70-79, Jan. 2006.
[11] L.M. Censier and P. Feautrier, "A New Solution to Coherence Problems in Multicache Systems," IEEE Trans. Computers, vol. 27, no. 12, pp. 1112-1118, Dec. 1978.
[12] L. Ceze, J. Tuck, C. Cascaval, and J. Torrellas, "Bulk Disambiguation of Speculative Threads in Multiprocessors," Proc. 33rd Int'l Symp. Computer Architecture (ISCA), pp. 227-238, June 2006.
[13] L. Cheng, J.B. Carter, and D. Dai, "An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing," Proc. 13th Int'l Conf. High-Performance Computer Architecture (HPCA), pp. 328-339, Feb. 2007.
[14] L. Cheng, N. Muralimanohar, K. Ramani, R. Balasubramonian, and J.B. Carter, "Interconnect-Aware Coherence Protocols for Chip Multiprocessors," Proc. 33rd Int'l Symp. Computer Architecture (ISCA), pp. 339-351, June 2006.
[15] J. Duato, S. Yalamanchili, and L.M. Ni, Interconnection Networks: An Engineering Approach. Morgan Kaufmann Publishers, Inc., 2002.
[16] N.D. Enright-Jerger, L.-S. Peh, and M.H. Lipasti, "Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Tree for Scalable Cache Coherence," Proc. 41st Int'l Symp. Microarchitecture (MICRO), pp. 35-46, Nov. 2008.
[17] H. Hossain, S. Dwarkadas, and M.C. Huang, "Improving Support for Locality and Fine-Grain Sharing in Chip Multiprocessors," Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 155-165, Oct. 2008.
[18] H.-C. Hsiao and C.-T. King, "Boosting the Performance of Now-Based Shared Memory Multiprocessors through Directory Hints," Proc. 20th Int'l Conf. Distributed Computing Systems (ICDCS), pp. 602-609, Apr. 2000.
[19] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S.W. Keckler, "A NUCA Substrate for Flexible CMP Cache Sharing," Proc. 19th Int'l Conf. Supercomputing (ICS), pp. 31-40, June 2005.
[20] D. Kanter, "The Common System Interface: Intel's Future Interconnect," Real World Tech nologies, Aug. 2007.
[21] C. Kim, D. Burger, and S.W. Keckler, "An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches," Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 211-222, Oct. 2002.
[22] R. Kumar, V. Zyuban, and D.M. Tullsen, "Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling," Proc. 32nd Int'l Symp. Computer Architecture (ISCA), pp. 408-419, June 2005.
[23] H.Q. Le, W.J. Starke, J.S. Fields, F.P. O'Connell, D.Q. Nguyen, B.J. Ronchetti, W.M. Sauer, E.M. Schwarz, and M.T. Vaden, "IBM POWER6 Microarchitecture," IBM J. Research and Development, vol. 51, no. 6, pp. 639-662, Nov. 2007.
[24] M.-L. Li, R. Sasanka, S.V. Adve, Y.-K. Chen, and E. Debes, "The ALPBench Benchmark Suite for Complex Multimedia Applications," Proc. Int'l Symp. Workload Characterization, pp. 34-45, Oct. 2005.
[25] N. Magen, A. Kolodny, U. Weiser, and N. Shamir, "Interconnect-Power Dissipation in a Microprocessor," Proc. Int'l Workshop System Level Interconnect Prediction, pp. 7-13, Feb. 2004.
[26] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A Full System Simulation Platform," Computer, vol. 35, no. 2, pp. 50-58, Feb. 2002.
[27] M.M. Martin, "Token Coherence," PhD thesis, Univ. of Wisconsin—Madison, Dec. 2003.
[28] M.M. Martin, P.J. Harper, D.J. Sorin, M.D. Hill, and D.A. Wood, "Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors," Proc. 30th Int'l Symp. Computer Architecture (ISCA), pp. 206-217, June 2003.
[29] M.M. Martin, M.D. Hill, and D.A. Wood, "Token Coherence: Decoupling Performance and Correctness," Proc. 30th Int'l Symp. Computer Architecture (ISCA), pp. 182-193, June 2003.
[30] M.M. Martin, D.J. Sorin, A. Ailamaki, A.R. Alameldeen, R.M. Dickson, C.J. Mauer, K.E. Moore, M. Plakal, M.D. Hill, and D.A. Wood, "Timestamp Snooping: An Approach for Extending SMPs," Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 25-36, Nov. 2000.
[31] M.M. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood, "Multifacet's General Execution-Driven Multiprocessor Simulator (GEMS) Toolset," Computer Architecture News, vol. 33, no. 4, pp. 92-99, Sept. 2005.
[32] M.R. Marty, J.D. Bingham, M.D. Hill, A.J. Hu, M.M. Martin, and D.A. Wood, "Improving Multiple-CMP Systems Using Token Coherence," Proc. 11th Int'l Conf. High-Performance Computer Architecture (HPCA), pp. 328-339, Feb. 2005.
[33] K. Olukotun, B.A. Nayfeh, L. Hammond, K.G. Wilson, and K. Chang, "The Case for a Single-Chip Multiprocessor," Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 2-11, Oct. 1996.
[34] V. Puente, J.A. Gregorio, and R. Beivide, "SICOSYS: An Integrated Framework for Studying Interconnection Network in Multiprocessor Systems," Proc. 10th Euromicro Workshop Parallel, Distributed and Network-Based Processing, pp. 15-22, Jan. 2002.
[35] A. Ros, M.E. Acacio, and J.M. García, "Direct Coherence: Bringing Together Performance and Scalability in Shared-Memory Multiprocessors," Proc. 14th Int'l Conf. High Performance Computing (HiPC), pp. 147-160, Dec. 2007.
[36] A. Ros, M.E. Acacio, and J.M. García, "DiCo-CMP: Efficient Cache Coherency in Tiled CMP Architectures," Proc. 22nd Int'l Symp. Parallel and Distributed Processing (IPDPS), pp. 1-11, Apr. 2008.
[37] A. Ros, M.E. Acacio, and J.M. García, "Scalable Directory Organization for Tiled CMP Architectures," Proc. Int'l Conf. Computer Design (CDES), pp. 112-118, July 2008.
[38] M. Shah, J. Barreh, J. Brooks, R. Golla, G. Grohoski, N. Gura, R. Hetherington, P. Jordan, M. Luttrell, C. Olson, B. Saha, D. Sheahan, L. Spracklen, and A. Wynn, "UltraSPARC T2: A Highly-Threaded, Power-Efficient, SPARC SoC," Proc. IEEE Asian Solid-State Circuits Conf., pp. 22-25, Nov. 2007.
[39] P. Stenström, M. Brorsson, and L. Sandberg, "An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing," Proc. 20th Int'l Symp. Computer Architecture (ISCA), pp. 109-118, May 1993.
[40] H. Wang, L.-S. Peh, and S. Malik, "Power-Driven Design of Router Microarchitectures in On-Chip Networks," Proc. 36th Int'l Symp. Microarchitecture (MICRO), pp. 105-111, Dec. 2003.
[41] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Int'l Symp. Computer Architecture (ISCA), pp. 24-36, June 1995.
[42] L. Yen, J. Bobba, M.R. Marty, K.E. Moore, H. Volos, M.D. Hill, M.M. Swift, and D.A. Wood, "LogTM-SE: Decoupling Hardware Transactional Memory from Caches," Proc. 13th Int'l Conf. High-Performance Computer Architecture (HPCA), pp. 261-272, Feb. 2007.
[43] M. Zhang and K. Asanović, "Victim Replication: Maximizing Capacity While Hiding Wire Delay in Tiled Chip Multiprocessors," Proc. 32nd Int'l Symp. Computer Architecture (ISCA), pp. 336-345, June 2005.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool