The Community for Technology Leaders
RSS Icon
Issue No.10 - October (2008 vol.19)
pp: 1381-1395
Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But they also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges involved in designing a Breadth-First Search (BFS) algorithm for the Cell/B.E. processor. The proposed methodology combines a high-level algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processor-specific optimizations. Using a fine-grained global coordination strategy derived by the Bulk-Synchronous Parallel (BSP) model, we have determined an accurate performance model that has guided the implementation and the optimization of our algorithm. Our experiments on a pre-production Cell/B.E. board running at 3.2 GHz, show almost linear speedups when using multiple synergistic processing elements, and an impressive level of performance when compared to other processors. On graphs which offer sufficient parallelism, the Cell/B.E. is typically an order of magnitude faster than conventional processors, such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, and custom-designed architectures, such as the MTA-2 and BlueGene/L.
Performance of Systems, Emerging technologies, Communication/Networking and Information Technology
Daniele Paolo Scarpazza, Oreste Villa, Fabrizio Petrini, "Efficient Breadth-First Search on the Cell/BE Processor", IEEE Transactions on Parallel & Distributed Systems, vol.19, no. 10, pp. 1381-1395, October 2008, doi:10.1109/TPDS.2007.70811
[1] D. Bader, V. Agarwal, and K. Madduri, “On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study on List Ranking,” Proc. 21st IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '07), Mar. 2007.
[2] D.A. Bader and K. Madduri, “Designing Multithreaded Algorithms for Breadth-First Search and st-Connectivity on the Cray MTA-2,” Proc. Int'l Conf. Parallel Processing (ICPP '06), Aug. 2006.
[3] G. Bell, J. Gray, and A. Szalay, “Petascale Computational Systems,” Computer, vol. 39, no. 1, pp. 110-112, Jan. 2006.
[4] P. Bellens, J.M. Perez, R.M. Badia, and J. Labarta, “CellSs: A Programming Model for the Cell BE Architecture,” Proc. Int'l Conf. High-Performance Computing, Networking, Storage and Analysis (SuperComputing '06), Nov. 2006.
[5] F. Blagojevic, A. Stamatakis, C. Antonopoulos, and D. Nikolopoulos, “RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine,” Proc. 21st IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '07), Mar. 2007.
[6] B. Bouzas, J. Greene, R. Cooper, M. Pepe, and M.J. Prelle, “MultiCore Framework: An API for Programming Heterogeneous Multicore Processors,” Proc. First Workshop Software Tools for Multi-Core Systems (STMCS '06), Mar. 2006.
[7] J. Carter, L. Oliker, and J. Shalf, “Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems,” Proc. Seventh Int'l Meeting on High-Performance Computing for Computational Science (VECPAR '06), July 2006.
[8] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald, Parallel Programming in OpenMP. Morgan Kaufmann, 2001.
[9] A. Clauset, M.E.J. Newman, and C. Moore, “Finding Community Structure in Very Large Networks,” Physical Rev. E, vol. 6, no. 70, Dec. 2004.
[10] T. Davis, “Sparse Matrix Collection,” NA Digest, vol. 94, no. 42, /, Oct. 1994.
[11] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Proc. Sixth Symp. Operating System Design and Implementation (OSDI '04), pp. 137-150, Dec. 2004.
[12] F. Dehne, A. Ferreira, E. Caceres, W. Song, and A. Roncato, “Efficient Parallel Graph Algorithms for Coarse-Grained Multcomputers and BSP,” Algorithmica, vol. 33, pp. 183-200, 2002.
[13] M. deLorimier, N. Kapre, N. Mehta, D. Rizzo, I. Eslick, R. Rubin, T.E. Uribe, T.F.J. Knight, and A. DeHon, “GraphStep: A System Architecture for Sparse-Graph Algorithms,” Proc. 14th IEEE Symp. Field-Programmable Custom Computing Machines (FCCM), 2006.
[14] R. Drost, C. Forrest, B. Guenin, R. Ho, A. Krishnamoorty, D. Cohen, J. Cunningham, B. Tourancheau, A. Zingher, A. Chow, G. Lauterbach, and I. Sutherland, “Challenges in Building a Flat-Bandwidth Memory Hierarchy for a Large-Scale Computer with Proximity Communication,” Proc. 13th IEEE Symp. High-Performance Interconnects (Hot Interconnects '05), Aug. 2005.
[15] J. Duato, “A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 12, pp. 1320-1331, Dec. 1993.
[16] J. Duato, “A Necessary and Sufficient Condition for Deadlock-Free Adaptive Routing in Wormhole Networks,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 10, pp. 1055-1067, Oct. 1995.
[17] J. Duch and A. Arenas, “Community Detection in Complex Networks Using Extremal Optimization,” Physical Rev. E, vol. 72, Jan. 2005.
[18] K. Fatahalian, T.J. Knight, M. Houston, M. Erez, D.R. Horn, L. Leem, J.Y. Park, M. Ren, A. Aiken, W.J. Dally, and P. Hanrahan, “Sequoia: Programming the Memory Hierarchy,” Proc. Int'l Conf. High-Performance Computing, Networking, Storage and Analysis (SuperComputing '06), Nov. 2006.
[19] J. Feo, Optimized BFS Algorithm on the MTA-2 Architecture, personal comm., Nov. 2006.
[20] J. Feo, D. Harpera, S. Kahan, and P. Konecny, “ELDORADO,” Proc. ACM Int'l Conf. Computing Frontiers (CF '05), May 2005.
[21] J. Fernández, E. Frachtenberg, and F. Petrini, “BCS MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers,” Proc. Int'l Conf. High-Performance Computing, Networking, Storage and Analysis (SuperComputing '03), Nov. 2003.
[22] U. Geuder, M. Hardtner, B. Worner, and R. Zink, “Scalable Execution Control of Grid-Based Scientific Applications on Parallel Systems,” Proc. Scalable High-Performance Computing Conf. (SHPCC '94), pp. 788-795, May 1994.
[23] D. Gregor and A. Lumsdaine, “Lifting Sequential Graph Algorithms for Distributed-Memory Parallel Computation,” Proc. ACM 20th Conf. Object-Oriented Programming Systems, Languages and Applications (OOPSLA '05), Oct. 2005.
[24] M. Guo, “Automatic Parallelization and Optimization for Irregular Scientific Applications,” Proc. 18th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '04), Apr. 2004.
[25] J.A. Kahle, M.N. Day, H.P. Hofstee, C.R. Johns, T.R. Maeurer, and D. Shippy, “Introduction to the Cell Multiprocessor,” IBM J.Research and Development, pp. 589-604, July-Sept. 2005.
[26] M. Kistler, M. Perrone, and F. Petrini, “Cell Processor Interconnection Network: Built for Speed,” IEEE Micro, vol. 25, no. 3, May/June 2006.
[27] P. Kongetira, K. Aingaran, and K. Olokotun, “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005.
[28] R. Kota and R. Oehler, “Horus: Large-Scale Symmetric Multiprocessing for Opteron Systems,” IEEE Micro, vol. 25, no. 2, pp.30-40, Mar./Apr. 2005.
[29] D. Kunzman, G. Zheng, E. Bohm, and L.V. Kalè, “Charm++, Offload API, and the Cell Processor,” Proc. Workshop Programming Models for Ubiquitous Parallelism (PMUP '06), Sept. 2006.
[30] J. Kurzak and J. Dongarra, “Implementation of the Mixed-Precision in Solving Systems of Linear Equations on the Cell Processor,” technical report, Univ. of Tennessee, 2006.
[31] E.A. Lee, “The Problem with Threads,” Computer, vol. 39, no. 5, pp. 33-42, May 2006.
[32] C. McNairy and R. Bhatia, “Montecito: A Dual-Core Dual-Thread Itanium Processor,” IEEE Micro, vol. 25, no. 2, pp. 10-20, Mar./Apr. 2005.
[33] J. Mellor-Crummey and M. Scott, “Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors,” ACM Trans. Computer Systems, vol. 9, no. 1, pp. 21-64, Feb. 1991.
[34] J. Montrym and H. Moreton, “The GeForce 6800,” IEEE Micro, vol. 25, no. 2, pp. 41-51, Mar./Apr. 2005.
[35] M.E.J. Newman, “Detecting Community Structure in Networks,” European Physical J. B, vol. 38, pp. 321-330, May 2004.
[36] M.E.J. Newman, “Fast Algorithm for Detecting Community Structure in Networks,” Physical Rev. E, vol. 69, no. 6, p. 066133, June 2004.
[37] M.E.J. Newman and M. Girvan, “Finding and Evaluating Community Structure in Networks,” Physical Rev. E, vol. 69, no. 2, p. 026113, Feb. 2004.
[38] D.S. Nikolopoulos and T.S. Papatheodorou, “The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors,” Int'l J. Parallel Programming, vol. 29, no. 3, pp. 249-282, Oct. 2001.
[39] “GeForce 8800 GPU Architecture Overview,” technical brief, NVIDIA, , 2008.
[40] M. Ohara, H. Inoue, Y. Sohda, H. Komatsu, and T. Nakatani, “MPI Microtask for Programming the Cell Broadband Engine Processor,” IBM Systems J., vol. 45, no. 1, pp. 85-102, Jan. 2006.
[41] L. Oliker, R. Biswas, J. Borrill, A. Canning, J. Carter, M.J. Djomehri, H. Shan, and D. Skinner, “A Performance Evaluation of the Cray X1 for Scientific Applications,” Proc. Sixth Int'l Meeting on High-Performance Computing for Computational Science (VECPAR '04), pp. 51-65, June 2004.
[42] F. Petrini, J. Fernández, A. Moody, E. Frachtenberg, and D.K. Panda, “NIC-Based Reduction Algorithms for Large-Scale Clusters,” Int'l J. High-Performance Computing and Networking, vol. 4, nos. 3/4, pp. 122-136, Feb. 2006.
[43] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S.W. Keckler, and C.R. Moore, “Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture,” Proc. 30th Ann. Int'l Symp. Computer Architecture (ISCA '03), pp. 422-433, 2003.
[44] M.C. Smith, J.S. Vetter, and X. Liang, “Accelerating Scientific Applications with the SRC-6 Reconfigurable Computer: Methodologies and Analysis,” Proc. 19th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '05), vol. 4, Apr. 2005.
[45] V. Subramaniam and P.-H. Cheng, “A Fast Graph Search Multiprocessor Algorithm,” Proc. Nat'l Aerospace and Electronics Conf. (NAECON '97), July 1997.
[46] A. Sud, E. Andersen, S. Curtis, M.C. Lin, and D. Manocha, “Real-Time Path Planning for Virtual Agents in Dynamic Environments,” Proc. IEEE Virtual Reality Conf. (VR '07), Mar. 2007.
[47] M.B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, “The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs,” IEEE Micro, vol. 22, no. 2, pp. 25-35, 2002.
[48] L.G. Valiant, “A Bridging Model for Parallel Computation,” Comm. ACM, vol. 33, no. 8, pp. 103-111, 1990.
[49] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick, “The Potential of the Cell Processor for Scientific Computing,” Proc. ACM Int'l Conf. Computing Frontiers (CF '06), May 2006.
[50] W.A. Wulf and S.A. McKee, “Hitting the Memory Wall: Implications of the Obvious,” ACM Computer Architecture News, vol. 23, no. 1, Mar. 1995.
[51] A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, and U. Catalyurek, “A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L,” Proc. Int'l Conf. for High-Performance Computing, Networking, Storage and Analysis (SuperComputing '05), Nov. 2005.
[52] L. Zhang, Y.J. Kim, and D. Manocha, “A Simple Path Non-Existence Algorithm Using C-Obstacle Query,” Proc. Int'l Workshop Algorithmic Foundations of Robotics (WAFR '06), July 2006.
[53] Y. Zhao and K. Kennedy, “Dependence-Based Code Generation for a Cell Processor,” Proc. 19th Int'l Workshop Languages and Compilers for Parallel Computing (LCPC '06), Nov. 2006.
15 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool