This Article 
 Bibliographic References 
 Add to: 
Optimizing Graph Algorithms for Improved Cache Performance
September 2004 (vol. 15 no. 9)
pp. 769-782

Abstract—In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall Algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of \Omega (N^3/\sqrt{C}), where N and C are the problem size and cache size, respectively. Experimental results show that this cache-oblivious implementation shows more than six times the improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Second, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for minimum spanning tree problem. For these algorithms, we demonstrate up to two times the improvement in real execution time by using a simple cache-friendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of two to three times in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and, then, solving the whole problem using the suboptimal solution as a starting point. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines.

[1] ADVISOR Project,http:/, 2001.
[2] M. Brenner, Multiagent Planning with Partially Ordered Temporal Plans Proc Int'l Joint Conf. Artificial Intelligence, 2003.
[3] D. Burger and T. Austin, The SimpleScalar Tool Set, Version 2.0 Univ. of Wisconsin-Madison Computer Sciences Dept. Technical Report #1342, 1997.
[4] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, and S. McKee, Impulse: Memory System Support for Scientific Applications J. Scientific Programming, vol. 7, nos. 3-4, 1999.
[5] S. Chatterjee, V. Jain, A. Lebeck, S. Mundhra, and M. Thottethodi, Nonlinear Array Layouts for Hierarchical Memory Systems Proc. ACM Symp. Parallel Algorithms and Architectures, 1999.
[6] T. Chilimbi, M. Hill, and J. Larus, Cache-Conscious Structure Layout Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, 1999.
[7] T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms. MIT Press, 1990.
[8] N. Dutt, P. Panda, and A. Nicolau, Data Organization for Improved Performance in Embedded Processor Applications ACM Trans. Design Automation of Electronic Systems, vol. 2, no. 4, Oct. 1997.
[9] J. Frens and D. Wise, Auto-Blocking Matrix-Multiplication or Tracking BLAS3 Performance from Source Code Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, June 1997.
[10] M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran, Cache-Oblivious Algorithms Proc. 40th Ann. Symp. Foundations of Computer Science, pp. 17-18, Oct. 1999.
[11] R. Gallagher and D. Bertsekas, Data Networks. Prentice Hall, 1987.
[12] S. Gerez, Algorithms for VLSI Design Automation. Wiley, 1998.
[13] A. Gonzalez, M. Valero, N. Topham, and J.M. Parcerisa, Eliminating Cache Conflict Misses through XOR-Based Placement Functions Proc. 1997 Int'l Conf. Supercomputing, July 1997.
[14] J. Hong and H. Kung, I/O Complexity: The Red Blue Pebble Game Proc. ACM Symp. Theory of Computing, 1981.
[15] M. Kallahalla and P.J. Varman, Optimal Prefetching and Caching for Parallel I/O Systems Proc. 13th ACM Symp. Parallel Algorithms and Architectures, 2001.
[16] M. Lam, E. Rothberg, and M. Wolf, The Cache Performance and Optimizations of Blocked Algorithms Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Apr. 1991.
[17] A. LaMarca and R. Ladner, The Influence of Caches on the Performance of Heaps ACM J. Experimental Algorithmics, vol. 1, 1996.
[18] E. Lawler, Combinatorial Optimization: Networks and Matroids. New York: Holt, Rhinehart, and Winston, 1976.
[19] R. Murphy and P.M. Kogge, The Characterization of Data Intensive Memory Workloads on Distributed PIM Systems Proc. Intelligent Memory Systems Workshop, ASPLOS-IX 2000, Nov. 2000.
[20] A. Nakaya, S. Goto, and M. Kanehisa, Extraction of Correlated Gene Clusters by Multiple Graph Comparison Genome Informatics, vol. 12, 2001.
[21] J. Park, M. Penner, and V.K. Prasanna, Optimizing Graph Algorithms for Improved Cache Performance Technical Report USC-CENG 03-03, Dept. of Electrical Eng., Univ. of Southern California, Nov. 2003.
[22] N. Park, B. Hong, and V. Prasanna, Tiling, Block Data Layout, and Memory Hierarchy Performance IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 7, July 2003.
[23] N. Park, B. Hong, and V. Prasanna, Analysis of Memory Hierarchy Performance of Block Data Layout Proc. Int'l Conf. Parallel Processing (ICPP), Aug. 2002.
[24] N. Park, D. Kang, K. Bondalapati, and V.K. Prasanna, Dynamic Data Layouts for Cache-Conscious Factorization of DFT Proc. Int'l Parallel and Distributed Processing Symp. 2000, Apr. 2000.
[25] D. Patterson and J. Hennessy, Computer Architecture: A Quantitative Approach, second ed. San Francisco, Calif.: Morgan Kaufmann, 1996.
[26] M. Penner and V. Prasanna, Cache-Friendly Implementations of Transitive Closure Proc. Int'l Conf. Parallel Architectures and Compiler Techniques, Sept. 2001.
[27] G. Rivera and C. Tseng, Data Transformations for Eliminating Conflict Misses Proc. 1998 ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1998.
[28] F. Rastello and Y. Robert, Loop Partitioning Versus Tiling for Cache-Based Multiprocessor Proc. Int'l Conf. Parallel and Distributed Computing and Systems, 1998.
[29] S. Sahni, Data Structures, Algorithms, and Applications in Java. New York: McGraw Hill, 2000.
[30] P. Sanders, Fast Priority Queues for Cached Memory ACM J. Experimental Algorithmics, vol. 5, 2000.
[31] S. Sarawagi, R. Agrawal, and A. Gupta, On Computing the Data Cube Research Report 10026, IBM Almaden Research Center, San Jose, Calif., 1996.
[32] S. Sen and S. Chatterjee, Towards a Theory of Cache-Efficient Algorithms Proc. Symp. Discrete Algorithms, 2000.
[33] SPIRAL Project,, 2004.
[34] G. Venkataraman, S. Sahni, and S. Mukhopadhyaya, A Blocked All-Pairs Shortest-Paths Algorithm Proc. Scandinavian Workshop Algorithms and Theory, 2000.
[35] D. Weikle, S. McKee, and W. Wulf, Caches as Filters: A New Approach to Cache Analysis Proc. Grace Murray Hopper Conf., Sept. 2000.
[36] R. Whaley and J. Dongarra, Automatically Tuned Linear Algebra Software High Performance Computing and Networking, Nov. 1998.
[37] M. Yannakakis, Graph Theoretic Methods in Database Theory Proc. ACM Conf. Principles of Database Systems, 1990.

Index Terms:
Cache-friendly algorithms, cache-oblivious algorithms, graph algorithms, shortest path, minimum spanning trees, graph matching, data layout optimizations, algorithm performance.
Joon-Sang Park, Michael Penner, Viktor K. Prasanna, "Optimizing Graph Algorithms for Improved Cache Performance," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 9, pp. 769-782, Sept. 2004, doi:10.1109/TPDS.2004.44
Usage of this product signifies your acceptance of the Terms of Use.