This Article 
 Bibliographic References 
 Add to: 
Retargeting Sequential Image-Processing Programs for Data Parallel Execution
February 2005 (vol. 31 no. 2)
pp. 116-136
Lewis B. Baumstark, Jr., IEEE Computer Society
New compact, low-power implementation technologies for processors and imaging arrays can enable a new generation of portable video products. However, software compatibility with large bodies of existing applications written in C prevents more efficient, higher performance data parallel architectures from being used in these embedded products. If this software could be automatically retargeted explicitly for data parallel execution, product designers could incorporate these architectures into embedded products. The key challenge is exposing the parallelism that is inherent in these applications but that is obscured by artifacts imposed by sequential programming languages. This paper presents a recognition-based approach for automatically extracting a data parallel program model from sequential image processing code and retargeting it to data parallel execution mechanisms. The explicitly parallel model presented, called multidimensional data flow (MDDF), captures a model of how operations on data regions (e.g., rows, columns, and tiled blocks) are composed and interact. To extract an MDDF model, a partial recognition technique is used that focuses on identifying array access patterns in loops, transforming only those program elements that hinder parallelization, while leaving the core algorithmic computations intact. The paper presents results of retargeting a set of production programs to a representative data parallel processor array to demonstrate the capacity to extract parallelism using this technique. The retargeted applications yield a potential execution throughput limited only by the number of processing elements, exceeding thousands of instructions per cycle in massively parallel implementations.

[1] G.A. Baxes, Digital Image Processing: Principles and Applications. John Wiley and Sons, pp. 86-99, 1994.
[2] G.H. Barnes, R.M. Brown, M. Kato, D.J. Kuck, D.L. Slotnick, and R.A. Stokes, “The ILLIAC-IV Computer,” IEEE Trans on Computers, vol. 17, no. 8, pp. 746-757, Aug. 1968.
[3] W.J. Bouknight, S.A. Denenberg, D.E. Mclntyre, J.M. Randall, A.H. Sameh, and D.L. Slotnick, “The Illiac IV System,” Proc. IEEE, vol. 60, no. 4, pp. 369-388, Apr. 1972.
[4] L.W. Tucker and G.G. Robertson, “Architecture and Applications of the Connection Machine,” Computer, vol. 21, no. 8, pp. 26-38, Aug. 1988.
[5] T. Blank, “The Maspar MP-1 Architecture,” Proc. IEEE Compcon Conf., pp. 20-24, 1990.
[6] W. Kim and R. Tuck, “MasPar MP-2 PE Chip: A Totally Cool Hot Chip,” Proc. Hot Chips V Conf., Mar. 1993.
[7] Semiconductor Industry Association, “The International Technology Roadmap for Semiconductors,” http:/, 2004.
[8] A. Gentile and S. Wills, “Portable Video Supercomputing,” IEEE Trans. Computers, vol. 53, no. 8, pp. 960-973, Aug. 2004.
[9] R. Lee and M. Smith, “Media Processing: A New Design Target,” IEEE Micro, vol. 16, no. 4, pp. 6-9, Aug. 1996.
[10] A. Peleg and U. Weiser, “MMX Technology Extensions to the Intel Architecture,” IEEE Micro, vol. 16, no. 4, pp. 42-50, Aug. 1996.
[11] S.K. Raman, V. Pentkovski, and J. Keshava, “Implementing Streaming SIMD Extensions on the Pentium III Processor,” IEEE Micro, vol. 20, no. 4, pp. 47-57, July/Aug. 2000.
[12] S. Oberman, G. Favor, and F. Weber, “AMD 3DNow! Technology: Architecture and Implementations,” IEEE Micro, vol. 19, no. 2, pp. 37-48, Mar./Apr. 1999.
[13] M. Phillip, “Altivec Technology: Accelerating Media Processing across the Spectrum,” Proc. HOTCHIPS-X Conf., Aug. 1998.
[14] H. Nguyen and L. John, “Exploiting SIMD Parallelism in DSP and Multimedia Algorithms using the AltiVec Technology,” Proc. Int'l Supercomputer Conf., pp. 11-20, June 1999.
[15] L. Kohn, G. Maturana, M. Tremblay, A. Prabhu, and G. Zyner, “The Visual Instruction Set (VIS) in UltraSPARC (tm),” Proc. IEEE Compcon Conf., pp. 462-469, Mar. 1995.
[16] R. Lee, “Subword Parallelism with MAX-2,” IEEE Micro, vol. 16, no. 4, pp. 51-59, Aug. 1996.
[17] H.H. Cat, A. Gentile, J.C. Eble, M. Lee, O. Vendier, Y.J. Joo, D.S. Wills, M. Brooke, N.M. Jokerst, A.S. Brown, and R. Leavitt, “SIMPil: An OE Integrated SIMD Architecture for Focal Plane Processing Applications,” Proc. Third Int'l Conf. Massively Parallel Processing Using Optical Interconnections, pp. 44-52, Oct. 1996.
[18] N. Yamashita, T. Kimura, Y. Fujita, Y. Aimoto, T. Manabe, S. Okazaki, K. Nakamura, and M. Yamashina, “A 3.84 GIPS Integrated Memory Array Processor with 64 Processing Elements and a 2-Mb SRAM,” IEEE J. Solid-State Circuits, vol. 29, no. 11, pp. 1336-1343, Nov. 1994.
[19] S. Kyo, “A 51. 2GOPS Programmable Video Recognition Processor for Vision based Intelligent Cruise Control Applications,” Proc. IAPR Workshop Machine Vision Applications, pp. 632-635, 2002.
[20] M. Colaïtis, J. Jumpertz, B. Guérin, B. Chéron, F. Battini, B. De Lescure, E. Gautier, and J. Geffroy, “The Implementation of $\rm P^3I$ , a Parallel Architecture for Video Real-Time Processing: A Case Study,” Proc. IEEE, vol. 84, no. 7, pp. 1019-1037, July 1996.
[21] D. Dale, L. Grate, E. Rice, and R. Hughey, “The UCSC Kestrel General Purpose Parallel Processor,” Proc. Int'l Conf. Parallel and Distributed Processing Techniques and Applications, pp. 1243-1249, June 1999.
[22] WorldScape, Inc., Single-Instruction Multi-Threaded Array Processor (SIMTAP),, May 2004.
[23] R.P. Kleihorst, A.A. Abbo, A. van der Avoird, M.J.R. Op de Beeck, L. Sevat, P. Wielage, R. van Veen, and H. van Herten, “Xetal: A Low-Power High-Performance Smart Camera Processor,” Proc. IEEE Int'l Symp. Circuits and Systems (ISCAS), vol. 5, pp. 215-218, May 2001.
[24] L. Baumstark, M. Guler, and L. Wills, “Extracting an Explicitly Data Parallel Representation of Image-Processing Programs,” Proc. 10th Working Conf. Reverse Eng., pp. 24-33, Nov. 2003.
[25] TMS320C62x Image/Video Processing Library Programmer's Reference. Texas Instruments Literature Number SPRU400, Mar. 2000.
[26] J.R. Allen and K. Kennedy, “Automatic Translation of FORTRAN Programs to Vector Form,” ACM Trans. Programming Languages and Systems, vol. 9, no. 4, pp. 491-542, Oct. 1987.
[27] R. Allen, D. Bäumgartner, K. Kennedy, and A. Porterfield, “PTOOL: A Semi-Automatic Parallel Programming Assistant,” Proc. Int'l Conf. Parallel Processing, pp. 164-170, Aug. 1986.
[28] K. Smith and W.F. Appelbe, “PAT: An Interactive Fortran Parallelizing Assistant Tool,” Proc. Int'l Conf. Parallel Processing, vol. 2, pp. 58-62, Aug. 1988.
[29] B. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. Padua, P. Peterson, B. Pottenger, L. Rauchwerger, P. Tu, and S. Weatherford, “Polaris: The Next Generation in Parallelizing Compilers,” Proc. Seventh Workshop Languages and Compilers for Parallel Computing, pp. 10.1-10.18, Aug. 1994.
[30] F. Allen, M. Burke, P. Charles, R. Cytron, and J. Ferreant, “An Overview for the PTRAN Analysis System for Multiprocessing,” J. Parallel and Distributed Computing, vol. 5, no. 5, pp. 617-640, 1988.
[31] C.D. Polychronopoulos, M. Girkar, M.R. Haghighat, C.L. Lee, B. Leung, and D. Schouten, “Paraphrase-2: An Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors,” Proc. 1989 Int'l Conf. Parallel Processing, vol. 2, pp. 39-48, Aug. 1989.
[32] J. Davies, C. Huson, T. Macke, B. Leasure, and M. Wolfe, “The KAP/S-1: An Advanced Source-to-Source Vectorizer for the S-1 Mark IIa Supercomputer,” Proc. 1986 Int'l Conf. Parallel Processing, pp. 833-835, Aug. 1986.
[33] T. Macke, C. Huson, J. Davies, B. Leasure, and M. Wolfe, “The KAP/ST-100: A Fortran Translator for the ST-100 Attached Processor,” Proc. Int'l Conf. Parallel Processing, pp. 171-175, Aug. 1986.
[34] C. Huson, T. Macke, J. Davies, M. Wolfe, and B. Leasure, “The KAP/205: An Advanced Source-to-Source Vectorizer for the Cyber 205 Supercomputer,” Proc. Int'l Conf. Parallel Processing, pp. 827-832, Aug. 1986.
[35] D. Kuck, R. Kuhn, B. Leasure, and M. Wolfe, “Analysis and Transformation of Programs for Parallel Computation,” Proc. Fourth Int'l Computer Software and Applications Conf. (COMPSAC 80), pp. 709-715, Oct. 1980.
[36] H. Zima and B. Chapman, Supercompilers for Parallel and Vector Computers. ACM Press, pp. 112-172, 218-238, 1991.
[37] D. Callahan, J. Dongarra, and D. Levine, “Vectorizing Compilers: A Test Suite and Results,” Proc. Supercomputing '88 Conf., pp. 98-105, Nov. 1988.
[38] U. Banerjee, R. Eigenmann, A. Nicolau, and D.A. Padua, “Automatic Program Parallelization,” Proc. IEEE, vol. 81, no. 2, pp. 211-243, 1993.
[39] N. Sreraman and R. Govindarajan, “A Vectorizing Compiler for Multimedia Extensions,” Int'l J. Parallel Programming, vol. 28, no. 4, pp. 363-400, Aug. 2000.
[40] G. Cheong and M. Lam, “An Optimizer for Multimedia Instruction Sets,” Proc. Second SUIF Compiler Workshop, Aug. 1997.
[41] A.J.C. Bik, M. Girkar, P.M. Grey, and X. Tian, “Automatic Intra-Register Vectorization for the Intel Architecture,” Int'l J. Parallel Programming, vol. 30, no. 2, pp. 65-98, Apr. 2002.
[42] R.C. Waters, “A Method for Analyzing Loop Programs,” IEEE Trans. Software Eng., vol. 5, no. 3, pp. 237-247, May 1979.
[43] B. Di Martino and G. Ianello, “PAP Recognizer: A Tool for Automatic Recognition of Parallelizable Patterns,” Proc. Fourth Int'l Workshop Program Comprehension (WPC '96), pp. 164-173, Mar. 1996.
[44] B. Di Martino and H.P. Zima, “Support of Automatic Parallelization with Concept Comprehension,” J. Systems Architecture, vol. 45, nos. 6-7, pp. 427-439, 1999.
[45] B. Di Martino and C.W. Keßler, “Two Program Comprehension Tools for Automatic Parallelization,” IEEE Concurrency, vol. 8, no. 1, pp. 37-47, Jan.-Mar. 2000.
[46] C.W. Keßler, “Pattern-Driven Automatic Parallelization,” Scientific Programming, vol. 5, no. 3, pp. 251-274, Fall 1996.
[47] J. Ferrante, K.J. Ottenstein, and J.D. Warren, “The Program Dependence Graph and Its Use in Optimization,” ACM Trans. Programming Language Systems, vol. 9, no. 3, pp. 319-349, July 1987.
[48] A.J.C Bik, P.J. Brinkhaus, P.M.W. Knijnenburg, and H.A.G. Wijshoff, “Transformation Mechanisms in MT1,” Leiden Inst. of Advanced Computer Science Technical Report no. 1999-21, Leiden Univ., 1999.
[49] C. Rich and L.M. Wills, “Recognizing a Program's Design: A Graph-Parsing Approach,” IEEE Software, vol. 7, no. 1, pp. 82-89, Jan. 1990.
[50] C. Rich and R.C. Waters, The Programmer's Apprentice. ACM Press, pp. 23-78, 1990.
[51] P. Cockshott, “Vector Pascal,” Univ. of Glasgow, Sept. 2001, .
[52] I. Foster and K.M. Chandy, “Fortran M: A Language for Modular Parallel Programming,” J. Parallel and Distributed Computing, vol. 25, no. 1, pp. 24-35, Apr. 1995.
[53] D.B. Loveman, “High Performance Fortran,” IEEE Parallel and Distributed Technology: Systems and Applications, vol. 1, no. 1, pp. 25-42, Feb. 1993.
[54] M.C. Rinard, D.J. Scales, and M.S. Lam, “Heterogeneous Parallel Programming in Jade,” Proc. 1992 ACM/IEEE Conf. Supercomputing, pp. 245-256, Nov. 1992.
[55] J. Subhlok, J.M. Stichnoth, D.R. O'Hallaron, and T. Gross, “Exploiting Task and Data Parallelism on a Multicomputer,” Proc. Fourth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 13-22, May 1993.
[56] F. Franchetti and M. Puschel, “A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms,” Proc. Int'l Parallel and Distributed Processing Symp., Apr. 2002.
[57] L.S. Nyland, J.F. Prins, A. Goldberg, and P.H. Mills, “A Design Methodology for Data Parallel Applications,” IEEE Trans. Software Eng., vol. 26, no. 4, pp. 293-314, Apr. 2000.
[58] N. Devillard, “ESO C Library for an Image Processing Software Environment (eclipse),” Proc. Astronomical Data Analysis Software and Systems X ASP Conf. Series, vol. 238, pp. 525-528, 2001.
[59] P.K. Murthy and E.A. Lee, “Multidimensional Synchronous Dataflow,” IEEE Trans. Signal Processing, vol. 50, no. 8, pp. 2064-2079, Aug. 2002.
[60] E.A. Lee, “Overview of the Ptolemy Project,” Technical Memo UCB/ERL M01/11, Dept. of Electrical and Computer Eng. and Computer Science, Univ. of California, Berkeley, Mar. 2001.
[61] S. Bhattacharyya, P. Murthy, and E. Lee, “Synthesis of Embedded Software from Synchronous Dataflow Specifications,” J. VLSI Signal Processing, vol. 21, no. 2, pp. 151-166, June 1999.
[62] B.L. Chamberlain, S.-E. Choi, E.C. Lewis, C. Lin, L. Snyder, and D. Weathersby, “ZPL: A Machine Independent Programming Language for Parallel Computers,” IEEE Trans. Software Eng., vol. 26, no. 3, pp. 197-211, Mar. 2000.
[63] J. Gu, Z. Li, and G. Lee, “Symbolic Array Dataflow Analysis for Array Privatization and Program Parallelization,” Proc. 1995 ACM/IEEE Conf. Supercomputing, Dec. 1995.
[64] A. Gentile, S. Sander, L. Wills, and S. Wills, “The Impact of Grain Size on the Efficiency of Embedded SIMD Image Processing Architectures,” J. Parallel and Distributed Computing, vol. 64, no. 11, pp. 1318-1327, Nov. 2004.
[65] S. Sander, “Retargetable Compilation for Variable-Grain, Data Parallel Execution in Image Processing,” PhD dissertation, Dept. of Electrical and Computer Eng., Georgia Inst. of Tech nology, 2002.
[66] D. Burger and T. Austin, “The SimpleScalar Tool Set, Version 2.0,” Technical Report TR #1342, Univ. of Wisconsin-Madison Computer Sciences Dept., Madison, June 1997.
[67] L. Wills, T. Taha, L. Baumstark, and S. Wills, “Estimating Potential Parallelism for Platform Retargeting,” Proc. Ninth Working Conf. Reverse Eng. (WCRE '02), pp. 55-64, Oct. 2002.

Index Terms:
Reengineering, SIMD processors, data-level parallelization, explicitly parallel program representation, program recognition.
Lewis B. Baumstark, Jr., Linda M. Wills, "Retargeting Sequential Image-Processing Programs for Data Parallel Execution," IEEE Transactions on Software Engineering, vol. 31, no. 2, pp. 116-136, Feb. 2005, doi:10.1109/TSE.2005.26
Usage of this product signifies your acceptance of the Terms of Use.