The Community for Technology Leaders
RSS Icon
Issue No.12 - Dec. (2012 vol.23)
pp: 2205-2218
Luís Fabrício Wanderley Góes , University of Edinburgh, Edinburgh
Nikolas Ioannou , University of Edinburgh, Edinburgh
Polychronis Xekalakis , Intel Barcelona Research Center, Barcelona
Murray Cole , University of Edinburgh, Edinburgh
Marcelo Cintra , University of Edinburgh, Edinburgh
Skeleton or pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. In addition to simplifying the programming task, such well structured programs are also amenable to performance optimizations during code generation and also at runtime. In this paper, we present a new skeleton framework that transparently selects and applies performance optimizations in transactional worklist applications. Using a novel hierarchical autotuning mechanism, it dynamically selects the most suitable set of optimizations for each application and adjusts them accordingly. Our experimental results on the STAMP benchmark suite show that our skeleton autotuning framework can achieve performance improvements of up to 88 percent, with an average of 46 percent, over a baseline version for a 16-core system and up to 115 percent, with an average of 56 percent, for a 32-core system. These performance improvements match or even exceed those obtained by a static exhaustive search of the optimization space.
Skeleton programming, Optimization, Prefetching, Runtime, Parallel programming, Concurrent computing, parallel patterns and application-transparent adaptation, Concurrent programming, transactional memory
Luís Fabrício Wanderley Góes, Nikolas Ioannou, Polychronis Xekalakis, Murray Cole, Marcelo Cintra, "Autotuning Skeleton-Driven Optimizations for Transactional Worklist Applications", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 12, pp. 2205-2218, Dec. 2012, doi:10.1109/TPDS.2012.140
[1] J. Held, J. Bautista, and S. Koehl, "From a Few Cores to Many: A Tera-Scale Computing Research Overview," TR, Intel, 2006.
[2] M.J. Irwin and J.P. Shen, "Revitalizing Computer Architecture Research," Proc. Conf. Grand Research Challenges, 2005.
[3] E.A. Lee, "The Problem with Threads," Computer, vol. 39, no. 5, pp. 33-42, May 2006.
[4] K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, and K.A. Yelick, "A View of the Parallel Computing Landscape," Comm. ACM, vol. 52, no. 10, pp. 56-67, 2009.
[5] M. Cole, Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press & Pitman, 1989.
[6] M. McCool, "Structured Parallel Programming with Deterministic Patterns," Proc. Second USENIX Conf. Hot Topics in Parallelism (HotPar), pp. 25-30, 2010.
[7] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Proc. Symp. Operating System Design and Implementation (OSDI), pp. 137-150, 2004.
[8] J. Larus and R. Rajwar, Transactional Memory. Morgan & Claypool Publishers, 2006.
[9] T. Karcher and V. Pankratius, "Run-Time Automatic Performance Tuning for Multicore Applications," Euro-Par: Proc. 17th Int'l Conf. Parallel Processing, pp. 3-14, 2011.
[10] P. Felber, C. Fetzer, and T. Riegel, "Dynamic Performance Tuning of Word-Based Software Transactional Memory," Proc. 13th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), pp. 237-246, 2008.
[11] R.M. Yoo and H.-H.S. Lee, "Adaptive Transaction Scheduling for Transactional Memory Systems," Proc. 20th Ann. Symp. Parallelism in Algorithms and Architectures (SPAA), pp. 169-178, 2008.
[12] T. Mattson, B. Sanders, and B. Massingill, Patterns for Parallel Programming. Pearson Education, 2004.
[13] J. Reinders, Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. OREILLY, 2007.
[14] K. Fraser and T. Harris, "Concurrent Programming Without Locks," ACM Trans. Computer Systems, vol. 25, no. 2,article 5, 2007.
[15] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and P.L. Chew, "Optimistic Parallelism Requires Abstractions," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI), pp. 211-222, 2007.
[16] M.F. Spear, L. Dalessandro, V.J. Marathe, and M.L. Scott, "A Comprehensive Strategy for Contention Management in Software Transactional Memory," Proc. 14th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), pp 32-40, 2009.
[17] M. Ansari, M. Luján, C. Kotselidis, K. Jarvis, C.C. Kirkham, and I. Watson, "Steal-on-Abort: Improving Transactional Memory Performance through Dynamic Transaction Reordering," Proc. Fourth Int'l Conf. High Performance Embedded Architectures and Compilers (HiPEAC), pp. 4-18, 2009.
[18] M. Méndez-Lojo, D. Nguyen, D. Prountzos, X. Sui, M.A. Hassaan, M. Kulkarni, M. Burtscher, and K. Pingali, "Structure-Driven Optimizations for Amorphous Data-Parallel Programs," Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), pp. 3-14, 2010.
[19] Y. Song, S. Kalogeropulos, and P. Tirumalai, "Design and Implementation of a Compiler Framework for Helper Threading on Multicore Processors," Proc. 14th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 99-109, 2005.
[20] R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou, "Cilk: An Efficient Multithreaded Runtime System," J. Parallel and Distributed Computing, vol. 37, no. 1, pp. 55-69, Aug. 1996.
[21] J.D. Collins, H. Wang, D.M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J.P. Shen, "Speculative Precomputation: Long-Range Prefetching of Delinquent Loads," Proc. 28th Ann. Int'l Symp. Computer architecture (ISCA), pp. 14-25, 2001.
[22] C.C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun, "STAMP: Stanford Transactional Applications for Multi-Processing," Proc. IEEE Int'l Symp. Workload Characterization (IISWC), pp. 35-46, 2008.
[23] W. Baek, C.C. Minh, M. Trautmann, C. Kozyrakis, and K. Olukotun, "The Opentm Transactional Application Programming Interface," Proc. 16th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 376-387, 2007.
[24] V.J. Marathe, W.N. SchererIII, and M.L. Scott, "Adaptive software Transactional Memory," Proc. 19th Int'l Conf. Distributed Computing (DISC), pp. 354-368, 2005.
[25] R. Guerraoui, M. Herlihy, and B. Pochon, "Toward a Theory of Transactional Contention Managers," Proc. 24th Ann. ACM Symp. Principles of Distributed Computing (PODC), pp. 258-264, 2005.
[26] K. Nikas, N. Anastopoulos, G. Goumas, and N. Koziris, "Employing Transactional Memory and Helper Threads to Speedup Dijkstras Algorithm," Proc. Int'l Conf. Parallel Processing (ICPP), pp. 388-395, 2009.
[27] L. Kale and S. Krishnan, "Charm++: A Portable Concurrent Object Oriented System Based on c++," Proc. Eighth Ann. Conf. Object-Oriented Programming Systems, Languages, and Applications (OOSPLA), pp. 91-108, 1993.
[28] H. González-Vélez and M. Leyton, "A Survey of Algorithmic Skeleton Frameworks: High-Level Structured Parallel Programming Enablers," Software Practice Experiments, vol. 40, pp. 1135-1160, 2010.
[29] M. Kulkarni, M. Burtscher, R. Inkulu, K. Pingali, and C. Cascaval, "How Much Parallelism Is There in Irregular Applications?" Proc. 14th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), pp. 1-12, 2009.
25 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool