Subscribe

Issue No.05 - May (2012 vol.23)

pp: 944-957

Gwo Giun (Chris) Lee , National Cheng Kung University, Tainan

He-Yuan Lin , National Cheng Kung University, Tainan

Chun-Fu Chen , National Cheng Kung University, Tainan

Tsung-Yuan Huang , National Cheng Kung University, Tainan

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2011.230

ABSTRACT

Degree of parallelism (DoP) is an essential complexity metric that characterizes the number of independent operation sets (IOSs) that can be concurrently executed within an algorithm. This paper presents a generic framework to identify IOSs and to quantify the DoP based on rank theorem in linear algebra. This framework is applied to extract algorithmic parallelisms at various granularities, namely, multigrain parallelism. Our parallelism is intrinsic and platform independent and can provide insights into architectural information, thus facilitating mapping onto generic platforms and early back annotation for modifying algorithms. It plays a significant role in the concurrent optimization of both algorithms and architectures, referred to as Algorithm/Architecture Coexploration (AAC), by trading off between the DoP and the number of operations (NoO). This paper reports three case studies for AAC. The case study on an IDCT reveals that our framework accurately quantifies the parallelism for mapping the algorithm onto generic platforms, including FPGA and multicore systems. The IDCT parallelized by our technique surpasses a conventional spectral parallelization. By exploiting fine-grain parallelism, this paper presents a better porting of a discrete wavelet transform (DWT) onto single instruction multiple data (SIMD) machines compared with a commercial compiler. A high-quality deinterlacer is implemented on a low-cost multicore platform for real-time high-definition applications by analyzing the multigrain parallelism. These case studies reveal the effectiveness of our parallel analysis framework which is applicable to generic systems. Compared with traditional graph traversal techniques, our linear algebraic approach impressively features low complexity and is practical for complicated algorithms.

INDEX TERMS

Intrinsic parallelism, linear algebra, algorithm/architecture coexploration.

CITATION

Gwo Giun (Chris) Lee, He-Yuan Lin, Chun-Fu Chen, Tsung-Yuan Huang, "Quantifying Intrinsic Parallelism Using Linear Algebra for Algorithm/Architecture Coexploration",

*IEEE Transactions on Parallel & Distributed Systems*, vol.23, no. 5, pp. 944-957, May 2012, doi:10.1109/TPDS.2011.230REFERENCES

- [1] G. Tan, N. Sun, and G.R. Gao, "Improving Performance of Dynamic Programming via Parallelism and Locality on Multicore Architectures,"
IEEE Trans. Parallel and Distributed Systems, vol. 20, no. 2, pp. 261-274, Feb. 2009.- [2] E. Seo, J. Jeong, S. Park, and J. Lee, "Energy Efficient Scheduling of Real-Time Tasks on Multicore Processors,"
IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 11, pp. 1540-1552, Nov. 2008.- [3] G.G. Lee, Y.K. Chen, M. Mattavelli, and E.S. Jang, "Algorithm/Architecture Co-Exploration of Visual Computing: Overview and Future Perspectives,"
IEEE Trans. Circuits and Systems for Video Technology, vol. 19, no. 11, pp. 1576-1587, Nov. 2009.- [4] G.G. Lee, M.J. Wang, H.Y. Lin Drew, W.C. Su, and B.Y. Lin, "Algorithm/Architecture Co-Design of 3D Spatio-Temporal Motion Estimation for Video Coding,"
IEEE Trans. Multimedia, vol. 9, no. 3, pp. 455-465, Apr. 2007.- [5] G.M. Amdahl, "Validity of Single-Processor Approach to Achieving Large-Scale Computing Capability,"
Proc. Spring Joint Computer Conf. (AFIPS), pp. 483-485, 1967.- [6] A. Prihozhy, M. Mattavelli, and D. Mlynek, "Evaluation of the Parallelization Potential for Efficient Multimedia Implementations: Dynamic Evaluation of Algorithm Critical Path,"
IEEE Trans. Circuits and Systems for Video Technology, vol. 15, no. 5, pp. 593-608, May 2005.- [7] H.-Y. Lin and G.G. Lee, "Quantifying Intrinsic Parallelism via Eigen-Decomposition of Dataflow Graphs for Algorithm/Architecture Co-Exploration,"
Proc. IEEE Workshop Signal Processing Systems (SIPS), pp. 317-328, Oct. 2010.- [8] S.Y. Kung,
VLSI Array Processor. Prentice Hall, 1988.- [9] K. Högstedt, L. Carter, and J. Ferrante, "On the Parallel Execution of Tiled Loops,"
IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 3, pp. 307-321, Mar. 2003.- [10] M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam, "A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts,"
IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 2, pp. 115-135, Feb. 1999.- [11] J.W. Janneck, D. Miller, and D.B. Parlour, "Profiling Dataflow Programs,"
Proc. IEEE Int'l Conf. Multimedia and Expo (ICME '08), pp. 1065-1068, June 2008.- [12] B. Hendrickson and R. Leland, "An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations," Technical Report SAND92-1460, Sandia Nat'l Laboratories, 1992.
- [13] A. Pothen, H.D. Simon, and K.P. Liou, "Partitioning Sparse Matrices with Eigenvectors of Graphs,"
SIAM J. Matrix Analytical Applications, vol. 11, pp. 430-452, 1990.- [14] P. Lenders and J. Xue, "Eigenvectors-Based Parallelisation of Nested Loops with Affine Dependences,"
Proc. Int'l Conf. Algorithms and Architectures for Parallel Processing (ICAPP '97), pp. 357-366, Dec. 1997.- [15] T. Grotker, S. Liao, G. Martin, and S. Swan,
System Design with SystemC. Springer, 2002.- [16] L.-F. Chao and E.H.-M. Sha, "Scheduling Data-Flow Graphs via Retiming and Unfolding,"
IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 12, pp. 1259-1267, Dec. 1997.- [17] A.V. Oppenheim and R.W. Schaefer,
Discrete-Time Signal Processing. Prentice-Hall, 1989.- [18] S. Edeards, L. Lavagno, E.A. Lee, and A. Sangiovanni-Vincentelli, "Design of Embedded Systems: Formal Models, Validation and Synthesis,"
Proc. IEEE, vol. 85, no. 3, pp. 366-390, Mar. 1997.- [19] F.R.K. Chung,
Spectral Graph Theory (Regional Conferences Series in Mathematics), no. 92. AMS Bookstore 1997.- [20] M. Fiedler, "Algebraic Connectivity of Graphs,"
Czechoslovak Math. J., vol. 23, no. 2, pp. 298-305, 1973.- [21] B. Mohar,
The Laplacian Spectrum of Graphs, Y. Alavi, G. Chartrand, O. Ollermann, and A. Schwenk, eds., pp. 871-898. Wiley, 1991.- [22] E.K.P. Chong and S.H. Żak,
An Introduction to Optimization, third ed. John Wiley & Sons, 2008.- [23] D.C. Lay,
Linear Algebra and Its Applications, third ed. Addison Wesley, 2003.- [24] L. Zhuo and V.K. Prasanna, "Scalable and Modular Algorithms for Floating-Point Matrix Multiplication of Reconfigurable Computing Systems,"
IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 5, pp. 666-681, May 2008.- [25] W. Sweldens, "The Lifting Scheme: A Custom-Design Construction of Biorthogonal Wavelets,"
Applied and Computational Harmonic Analysis, vol. 3, pp. 186-200, 1996.- [26] Intel Compiler http://software.intel.com/en-usintel-compilers /, 2011.
- [27] Target http:/www.retarget.com/, 2011.
- [28] G.G. Lee, H.-Y. Lin, D.W.-C. Su, and M.-J. Wang, "Multiresolution-Based Texture Adaptive Algorithm for High-Quality Deinterlacing,"
IEICE - Trans. Information and System, vol. E90-D, no. 11, pp. 1821-1830, Nov. 2007.- [29] M. Monchiero, R. Canal, and A. González, "Power/Performance/Thermal Design-Space Exploration for Multicore Architectures,"
IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 5, pp. 666-681, May 2008.- [30] R.W. Farebrother,
Linear Least Squares Computations. Marcel Dekker, 1988. |