The Community for Technology Leaders
RSS Icon
Issue No.11 - Nov. (2012 vol.23)
pp: 2058-2066
Yong Li , Comput. Eng. Program, Univ. of Pittsburgh, Pittsburgh, PA, USA
A. Abousamra , Dept. of Comput. Sci., Univ. of Pittsburgh, Pittsburgh, PA, USA
R. Melhem , Dept. of Comput. Sci., Univ. of Pittsburgh, Pittsburgh, PA, USA
A. K. Jones , Electr. & Comput. Eng., Univ. of Pittsburgh, Pittsburgh, PA, USA
Data access latency, a limiting factor in the performance of chip multiprocessors, grows significantly with the number of cores in nonuniform cache architectures with distributed cache banks. To mitigate this effect, we use a compiler-based approach to leverage data access locality, choose an optimized data placement and efficiently configure the on-chip network. The proposed experimental compiler framework employs novel compilation techniques to discover and represent multithreaded memory access patterns (MMAPs). At runtime, symbolic MMAPs are resolved and used by a partitioning algorithm to choose a partition of allocated memory blocks among the forked threads in the analyzed application. This partition is used to enforce data ownership by associating the data with the core that executes the thread owning the data. Based on the partition, the communication pattern of the application can be extracted. We demonstrate how this information can be used in an experimental architecture to accelerate applications. In particular, our compiler assisted data partitioning approach shows a 20 percent speedup over shared caching and 5 percent speedup over the closest runtime approximation, first touch. By leveraging the communication pattern we can achieve a comparable performance to a system that uses a complex centralized network configuration system at runtime. Thus, our final system saves significant runtime complexity and achieves an 5.1 percent additional speedup through the addition of the reconfigurable network.
Arrays, Instruction sets, Runtime, Benchmark testing,data partition, Circuit switching, network-on-chip, communication, data access pattern
Yong Li, A. Abousamra, R. Melhem, A. K. Jones, "Compiler-Assisted Data Distribution and Network Configuration for Chip Multiprocessors", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 11, pp. 2058-2066, Nov. 2012, doi:10.1109/TPDS.2011.279
[1] C. Kim, D. Burger, and S.W. Keckler, "Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches," IEEE Micro, vol. 23, no. 6, pp. 99-107, Nov./Dec. 2003.
[2] J.A. Brown, R. Kumar, and D.M. Tullsen, "Proximity-Aware Directory-Based Coherence for Multi-Core Processor Architectures," Proc. 19th Ann. ACM Symp. Parallel Algorithms and Architectures (SPAA), pp. 126-134, 2007.
[3] J.M. Arnold, D.A. Buell, and E.G. Davis, "Splash 2," Proc. Fourth Ann. ACM Symp. Parallel Algorithms and Architectures (SPAA '92), pp. 316-322, 1992.
[4] C. Bienia, S. Kumar, J.P. Singh, and K. Li, "The Parsec Benchmark Suite: Characterization and Architectural Implications," Technical Report TR-811-08, Princeton Univ., Jan. 2008.
[5] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive nuca: Near-Optimal Block Placement and Replication in Distributed Caches," Proc. 36th Ann. Int'l Symp. Computer Architecture, June 2009.
[6] L. Jin and S. Cho, "Sos: A Software Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), 2009.
[7] A. Ros, M. Cintra, M.E. Acacio, and J.M. Garca, "Distance-Aware Round-Robin Mapping for Large nuca Caches," Proc. 16th Int'l Conf. High Performance Computing (HiPC), 2009.
[8] P. Tu and D. Padua, "Gated ssa-Based Demand-Driven Symbolic Analysis for Parallelizing Compilers," Proc. Ninth Int'l Conf. Supercomputing (SC), pp. 414-423, 1995.
[9] Z. Li and P. Yew, "Efficient Interprocedural Analysis for Program Parallelization and Restructuring," Proc. SIGPLAN Symp. Parallel Programming: Experience with Applications, Languages and Systems, July 1988.
[10] R. Triolet, F. Irigoin, and P. Feautrier, "Direct Parallelization of Call Statements," Proc. ACM SIGPLAN Symp. Compiler Construction (SIGPLAN '86), pp. 176-185, July 1986.
[11] B. Creusillet and F. Irigoin, "Exact versus Approximate Array Region Analyses," Proc. Ninth Int'l Workshop Language and Compilers for Parallel Computing, Aug. 1996.
[12] Y. Paek, "Automatic Parallelization for Distributed Memory Machines Based on Access Region Analysis," PhD dissertation, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, Apr. 1997.
[13] Y. Paek, E.Z.A. Navarro, J. Hoeflinger, and D. Padua, "An Advanced Compiler Framework for Noncache-coherent Multiprocessors," IEEE Trans. Parallel and Distributed Systems, vol. 13, no. 3, pp. 241-259, Mar. 2002.
[14] A.K. Abousamra, R.G. Melhem, and A.K. Jones, "Noc-Aware Cache Design for Chip Multiprocessors," Proc. 19th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '10), pp. 565-566, 2010.
[15] J. Kim, J.D. Balfour, and W.J. Dally, "Flattened Butterfly Topology for On-Chip Networks," Proc. IEEE/ACM 40th Ann. Int'l Symp. Microarchitecture (MICRO '07), pp. 172-182, 2007.
[16] J.D. Balfour and W.J. Dally, "Design Tradeoffs for Tiled cmp On-Chip Networks," Proc. 20th Ann. Int'l Conf. Supercomputing (ICS), pp. 187-198, 2006.
[17] S. Bourduas and Z. Zilic, "A Hybrid Ring/Mesh Interconnect for Network-on-Chip Using Hierarchical Rings for Global Routing," Proc. First Int'l Symp. Networks-on-Chip (NOCS), pp. 195-204, 2007.
[18] R. Das, S. Eachempati, A.K. Mishra, N. Vijaykrishnan, and C.R. Das, "Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation cmps," Proc. IEEE 15th Int'l Symp. High Performance Computer Architecture (HPCA), pp. 175-186, 2009.
[19] Y. Xu, Y. Du, B. Zhao, X. Zhou, Y. Zhang, and J. Yang, "A Low-Radix and Low-Diameter 3d Interconnection Network Design," Proc. IEEE 15th Int'l Symp. High Performance Computer Architecture (HPCA), pp. 30-42, 2009.
[20] B. Grot, J. Hestness, S.W. Keckler, and O. Mutlu, "Express Cube Topologies for On-Chip Interconnects," Proc. IEEE 15th Int'l Symp. High Performance Computer Architecture (HPCA), pp. 163-174, 2009.
[21] A. Kumar, L.-S. Peh, P. Kundu, and N.K. Jha, "Express Virtual Channels: Towards the Ideal Interconnection Fabric," Proc. 34th Int'l Symp. Computer Architecture (ISCA), June 2007.
[22] T. Krishna, A.K. 0002, P. Chiang, M. Erez, and L.-S. Peh, "Noc with Near-Ideal Express Virtual Channels Using Global-Line Communication," Proc. IEEE 16th Symp. Hot Interconnects (HOTI '08), pp. 11-20, 2008.
[23] N.D.E. Jerger, L.-S. Peh, and M.H. Lipasti, "Circuit-Switched Coherence," Proc. IEEE/ACM Second Int'l Symp. Networks-on-Chip (NOCS), pp. 193-202, 2008.
[24] A. Abousamra, R. Melhem, and A.K. Jones, "Winning with Pinning in NoC," Proc. IEEE 17th Symp. Hot Interconnects (HOTI '09), 2009.
[25] S. Shao, A.K. Jones, and R. Melhem, "Compiler Techniques for Efficient Communications in Circuit Switched Networks for Multiprocessor Systems," IEEE Trans. Parallel and Distributed Systems, vol. 20, no. 3, pp. 331-345, Mar. 2009.
[26] L.O. Andersen, "Program Analysis and Specialization for the c Programming Language," PhD dissertation, DIKU, Univ. of Copenhagen, 1994.
[27] R.P. Wilson, R.S. French, C.S. Wilson, S.P. Amarsinghe, J.M. Anderson, S.W.K. Tjiang, S.W. Liao, C.W. Tseng, M.W. Hall, M.S. Lam, and J.L. Hennessy, "Suif: An Infrastructure for Research on Parallelizing and Optimizing Compilers," ACM SIGPLAN Notices, vol. 29, no. 12, pp. 31-37, Dec. 1994.
[28] M. Chu and S. Mahlke, "Compiler-Directed Data Partitioning for Multicluster Processors," Proc. Int'l Symp. Code Generation and Optimization, 2006.
[29] A.K. Jones, S. Shao, Y. Zhang, and R. Melhem, "Symbolic Expression Analysis for Compiled Communication," Parallel Processing Letters, vol. 18, no. 4, pp. 567-587, Dec. 2008.
79 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool