The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - Dec. (2012 vol.23)
pp: 2266-2279
Oreste Villa , Pacific Northwest National Laboratory, Richland
Antonino Tumeo , Pacific Northwest National Laboratory, Richland
Simone Secchi , Pacific Northwest National Laboratory, Richland
Joseph B. Manzano , Pacific Northwest National Laboratory, Richland
ABSTRACT
Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2, and XMT, appear to address irregular application requirements better than commodity clusters. However, the research on massively multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy, and customization. At the same time, Shared Memory MultiProcessors (SMPs) with multicore processors have become an attractive platform to simulate large-scale systems. This paper introduces a cycle-level simulator of the massively multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques implemented to obtain high-simulation speed while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at runtime and includes a parametric network and memory model that takes into account contention and hot spotting. On a modern 48-core SMP host, the proposed infrastructure simulates a large set of irregular applications 500 to 2,000 times slower than real time when compared to a 128-processor XMT, with an accuracy error under 10 percent. Emulation is only from 25 to 200 times slower than real time. The paper also presents a case study, where the simulation infrastructure is used to identify bottlenecks in the current XMT architecture and to estimate the performance scaling of a possible multicore design with next generation memory and network interconnect.
INDEX TERMS
Instruction sets, Multiprocessing systems, Multithreading, Synchronization, Multicore processing, Computational modeling, simulation of multiple-processor systems, Modeling of computer architecture, multithreaded processors, system architectures, integration and modeling, measurement, evaluation, modeling
CITATION
Oreste Villa, Antonino Tumeo, Simone Secchi, Joseph B. Manzano, "Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 12, pp. 2266-2279, Dec. 2012, doi:10.1109/TPDS.2012.70
REFERENCES
[1] S.S. Mukherjee, S.D. Sharma, M.D. Hill, J.R. Larus, A. Rogers, and J. Saltz, "Efficient Support for Irregular Applications on Distributed-Memory Machines," Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP '95), pp. 68-79, 1995.
[2] D. Bader, G. Cong, and J. Feo, "On the Architectural Requirements for Efficient Execution of Graph Algorithms," Proc. Int'l Conf. Parallel Processing (ICCP), pp. 547-556, 2005.
[3] J. Feo, D. Harper, S. Kahan, and P. Konecny, "ELDORADO," Proc. Second Conf. Computing Frontiers (ICPP '05), pp. 28-34, 2005.
[4] U. Nawathe, M. Hassan, K. Yen, A. Kumar, A. Ramachandran, and D. Greenhill, "Implementation of an 8-Core, 64-Thread, Power-Efficient SPARC Server on a Chip," IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 6-20, Jan. 2008.
[5] J. Nieplocha, A. Márquez, J. Feo, D. Chavarría-Miranda, G. Chin, C. Scherrer, and N. Beagley, "Evaluating the Potential of Multithreaded Platforms for Irregular Scientific Computations," Proc. Fourth Int'l Conf. Computing Frontiers (CF '07), pp. 47-58, 2007.
[6] O. Villa, D. Chavarria-Miranda, and K. Maschhoff, "Input-Independent, Scalable and Fast String Matching on the Cray XMT," Proc. 23th Int'l Symp. Parallel and Distributed Processing, pp. 1-12, 2009.
[7] D.A. Bader and K. Madduri, "Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2," Proc. Int'l Conf. Parallel Processing, pp. 523-530, 2006.
[8] T. Austin, E. Larson, and D. Ernst, "SimpleScalar: An Infrastructure for Computer System Modeling," Computer, vol. 35, no. 2, pp. 59-67, Feb. 2002.
[9] F. Bellard, "QEMU, a Fast and Portable Dynamic Translator," Proc. USENIX Ann. Technical Conf. (ATEC '05), pp. 41-41, 2005.
[10] E.A. Brewer, C.N. Dellarocas, A. Colbrook, and W.E. Weihl, "PROTEUS: A High-Performance Parallel-Architecture Simulator," technical report, 1991.
[11] C. Hughes, V. Pai, P. Ranganathan, and S. Adve, "Rsim: Simulating Shared-Memory Multiprocessors with ILP Processors," Computer, vol. 35, no. 2, pp. 40-49, Feb. 2002.
[12] M. Rosenblum, S. Herrod, E. Witchel, and A. Gupta, "Complete Computer System Simulation: The SimOS Approach," IEEE Parallel Distributed Technology: Systems and Technology, vol. 3, no 4, pp. 34-43, Dec. 1995.
[13] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A Full System Simulation Platform," Computer, vol. 35, no. 2, pp. 50-58, Feb. 2002.
[14] J. Renau, B. Fraguela, J. Tuck, M.P.W. Liu, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos, "SESC Simulator," http:/sesc.sourceforge.net, Jan. 2005.
[15] M. Chidester and A. George, "Parallel Simulation of Chip-Multiprocessor Architectures," ACM Trans. Modeling and Computer Simulation, vol. 12, pp. 176-200, July 2002.
[16] S.S. Mukherjee, S.K. Reinhardt, B. Falsafi, M. Litzkow, M.D. Hill, D.A. Wood, S. Huss-Lederman, and J.R. Larus, "Wisconsin Wind Tunnel II: A Fast, Portable Parallel Architecture Simulator," IEEE Concurrency, vol. 8, no. 4, pp. 12-20, Oct. 2000.
[17] J. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, "Graphite: A Distributed Parallel Simulator for Multicores," Proc. 16th Int'l Symp. High Performance Computer Architecture (HPCA '16), pp. 1-12, 2010.
[18] M. Monchiero, J.H. Ahn, A. Falcón, D. Ortega, and P. Faraboschi, "How to Simulate 1000 Cores," ACM SIGARCH Computer Architecture News, vol. 37, no. 2, pp. 10-19, 2009.
[19] D. Burger and D. Wood, "Accuracy versus Performance in Parallel Simulation of Interconnection Networks," Proc. Ninth Int'l Parallel Processing Symp. (IPPS '95), pp. 22-31, Apr. 1995.
[20] D. Abts and D. Weisser, "Age-Based Packet Arbitration in Large-Radix k-Ary n-Cubes," Proc. ACM/IEEE Conf. Supercomputing (SC'07), pp. 5:1-5:11, 2007.
[21] S. Secchi, A. Tumeo, and O. Villa, "Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT," Proc. IEEE/ACM 11th Int'l Symp. Cluster, Cloud and Grid Computing (CCGRID '11), pp. 275-284, May 2011.
[22] D. Ediger, K. Jiang, J. Riedy, D.A. Bader, C. Corley, R. Farber, and W.N. Reynolds, "Massive Social Network Analysis: Mining Twitter for Social Good," Proc. 39th Int'l Conf. Parallel Processing (ICPP '10), 2010.
[23] G. Chin, A. Marquez, S. Choudhury, and K. Maschhoff, "Implementing and Evaluating Multithreaded Triad Census Algorithms on the Cray XMT," Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS '09), pp. 1-9, 2009.
[24] V. Batagelj and A. Mrvar, "Pajek Data Sets," http://vlado.fmf. uni-lj.si/pub/networks data/, 2006.
[25] "Tesla C1060 Processor Board," http:/www.nvidia.com, 2012.
[26] R. Compano and L. Molenkamp, "Technology Roadmap for Nanoelectronics," 1998.
[27] K. Bergman and S. Borkar, "Exascale Computing Study: Technology Challenges in Achieving Exascale Systems," technical report, 2008.
39 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool