The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - Dec. (2013 vol.62)
pp: 2516-2530
Minjang Kim , Georgia Institute of Technology, Atlanta
Nagesh B. Lakshminarayana , Georgia Institute of Technology, Atlanta
Hyesoon Kim , Georgia Institute of Technology, Atlanta
Chi-Keung Luk , Intel, Hudson
ABSTRACT
As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important program analysis technique to exploit parallelism in serial programs. More specifically, manual, semiautomatic, or automatic parallelization can use the outcomes of data-dependence profiling to guide where and how to parallelize in a program. However, state-of-the-art data-dependence profiling techniques consume extremely huge resources as they suffer from two major issues when profiling large and long-running applications: 1) runtime overhead and 2) memory overhead. Existing data-dependence profilers are either unable to profile large-scale applications with a typical resource budget or only report very limited information. In this paper, we propose an efficient approach to data-dependence profiling that can address both runtime and memory overhead in a single framework. Our technique, called SD$({}^3)$, reduces the runtime overhead by parallelizing the dependence profiling step itself. To reduce the memory overhead, we compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format. We demonstrate that SD$({}^3)$ reduces the runtime overhead when profiling SPEC 2006 by a factor of 4.1× and 9.7× on eight cores and 32 cores, respectively. For the memory overhead, we successfully profile 22 SPEC 2006 benchmarks with the reference input, while the previous approaches fail even with the train input. In some cases, we observe more than a 20× improvement in memory consumption and a 16× speedup in profiling time when 32 cores are used. We also demonstrate the usefulness of SD$({}^3)$ by showing manual parallelization followed by data dependence profiling results.
INDEX TERMS
Runtime, Memory management, Heuristic algorithms, Pararell processing, Resource management, Benchmark testing,parallelization, Profiling, data dependence, parallel programming, program analysis, compression
CITATION
Minjang Kim, Nagesh B. Lakshminarayana, Hyesoon Kim, Chi-Keung Luk, "SD3: An Efficient Dynamic Data-Dependence Profiling Mechanism", IEEE Transactions on Computers, vol.62, no. 12, pp. 2516-2530, Dec. 2013, doi:10.1109/TC.2012.182
REFERENCES
[1] OmpSCR: OpenMP Source Code Repository, http://sourceforge. net/projectsompscr/, 2013.
[2] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou, "Cilk: An Efficient Multithreaded Runtime System," Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP '95), 1995.
[3] T. Chen, J. Lin, X. Dai, W.-C. Hsu, and P.-C. Yew, "Data Dependence Profiling for Speculative Optimizations," Compiler Construction, vol. 2985, pp. 57-72, 2004.
[4] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms, third ed. The MIT Press, 2009.
[5] CriticalBlue, Prism: An Analysis Exploration and Verification Environment for Software Implementation and Optimization on Multicore Architectures, http:/www.criticalblue.com, 2013.
[6] D. Das and P. Wu, "Experiences of Using a Dependence Profiler to Assist Parallelization for Multi-Cores," Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS) Workshops, 2010.
[7] Z.-H. Du, C.-C. Lim, X.-F. Li, C. Yang, Q. Zhao, and T.-F. Ngai, "A Cost-Driven Compilation Framework for Speculative Parallelization of Sequential Programs," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '04), 2004.
[8] S. Garcia, D. Jeon, C.M. Louie, and M.B. Taylor, "Kremlin: Rethinking and Rebooting Gprof for the Multicore Age," Proc. 32nd ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '11), 2011.
[9] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994.
[10] J. Ha, M. Arnold, S.M. Blackburn, and K.S. McKinley, "A Concurrent Dynamic Analysis Framework for Multicore Hardware," Proc. 24th ACM SIGPLAN Conf. Object Oriented Programming Systems Languages and Applications (OOPSLA '09), 2009.
[11] Intel Corporation, Intel Parallel Advisor, http://software.intel. com/en-us/articles intel-parallel-advisor/, 2013.
[12] Intel Corporation, Intel Threading Building Blocks, http:/www.threadingbuildingblocks.org/, 2013.
[13] M. Kim, H. Kim, and C.-K. Luk, "Prospector: Helping Parallel Programming by a Data-Dependence Profile," Proc. Second USENIX Conf. Hot Topics in Parallelism (HotPar '10), 2010.
[14] M. Kim, H. Kim, and C.-K. Luk, "${\rm SD}^3$ : A Scalable Approach to Dynamic Data-Dependence Profiling," Proc. IEEE/ACM 43rd Ann. Int'l Symp. Microarchitecture (MICRO '43), 2010.
[15] X. Kong, D. Klappholz, and K. Psarris, "The I Test: An Improved Dependence Test for Automatic Parallelization and Vectorization," IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 3, pp. 342-349, July 1991.
[16] J.R. Larus, "Loop-Level Parallelism in Numeric and Symbolic Programs," IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 7, pp. 812-826, July 1993.
[17] J.R. Larus, "Whole Program Paths," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '99), 1999.
[18] C. Lattner and V. Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation," Proc. Int'l Symp. Code Generation and Optimization (CGO '04), 2004.
[19] W. Liu, J. Tuck, L. Ceze, W. Ahn, K. Strauss, J. Renau, and J. Torrellas, "POSH: A TLS Compiler That Exploits Program Structure," Proc. 11th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '06), 2006.
[20] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V.J. Reddi, and K. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '05), 2005.
[21] J. Marathe, F. Mueller, T. Mohan, S.A. Mckee, B.R. De Supinski, and A. Yoo, "METRIC: Memory Tracing via Dynamic Binary Rewriting to Identify Cache Inefficiencies," ACM Trans. Programming Languages and Systems, vol. 29, Apr. 2007.
[22] T. Moseley, A. Shye, V.J. Reddi, D. Grunwald, and R. Peri, "Shadow Profiling: Hiding Instrumentation Costs with Parallelism," Proc. Int'l Symp. Code Generation and Optimization (CGO '07), 2007.
[23] S.S. Muchnick, Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., 1997.
[24] G.D. Price, J. Giacomoni, and M. Vachharajani, "Visualizing Potential Parallelism in Sequential Programs," Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '08), 2008.
[25] S. Rul, H. Vandierendonck, and K. De Bosschere, "A Profile-Based Tool for Finding Pipeline Parallelism in Sequential Programs," Parallel Computing, vol. 36, pp. 531-551, Sept. 2010.
[26] J.G. Steffan, C.B. Colohan, A. Zhai, and T.C. Mowry, "A Scalable Approach to Thread-Level Speculation," Proc. 27th Ann. Int'l Symp. Computer Architecture (ISCA '00), 2000.
[27] W. Thies, V. Chandrasekhar, and S. Amarasinghe, "A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs," Proc. IEEE/ACM 40th Ann. Int'l Symp. Microarchitecture, 2007.
[28] G. Tournavitis and B. Franke, "Semi-Automatic Extraction and Exploitation of Hierarchical Pipeline Parallelism Using Profiling Information," Proc. 19th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '10), 2010.
[29] G. Tournavitis, Z. Wang, B. Franke, and M.F. O'Boyle, "Towards a Holistic Approach to Auto-Parallelization: Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '09), 2009.
[30] Vector Fabrics, Pareon: Optimize Applications for Multicore Phones, Tablets and x86, http:/www.vectorfabrics.com/.
[31] C. von Praun, R. Bordawekar, and C. Cascaval, "Modeling Optimistic Concurrency Using Quantitative Dependence Analysis," Proc. 13th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '08), 2008.
[32] S. Wallace and K. Hazelwood, "SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance," Proc. Int'l Symp. Code Generation and Optimization (CGO '07), 2007.
[33] P. Wu, A. Kejariwal, and C. Caşcaval, "Languages and Compilers for Parallel Computing," chapter Compiler-Driven Dependence Profiling to Guide Program Parallelization. 2008.
[34] X. Zhang and R. Gupta, "Whole Execution Traces," Proc. IEEE/ACM 37th Ann. Int'l Symp. Microarchitecture, 2004.
[35] X. Zhang, A. Navabi, and S. Jagannathan, "Alchemist: A Transparent Dependence Distance Profiling Infrastructure," Proc. IEEE/ACM Seventh Ann. Int'l Symp. Code Generation and Optimization (CGO '09), 2009.
[36] Q. Zhao, I. Cutcutache, and W.-F. Wong, "PiPA: Pipelined Profiling and Analysis on Multi-Core Systems," Proc. IEEE/ACM Sixth Ann. Int'l Symp. Code Generation and Optimization (CGO '08), 2008.
76 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool