This Article 
 Bibliographic References 
 Add to: 
A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems
June 1998 (vol. 9 no. 6)
pp. 601-608

Abstract—In parallel processor systems, the performance of individual processors is a key factor in overall performance. Processor performance is strongly affected by the behavior of cache memory in that high hit rates are essential for high performance. Hit rates are lowered when collisions on placing lines in the cache force a cache line to be replaced before it has been used to best effect. Spatial cache collisions occur if data structures and data access patterns are misaligned. We describe a mathematical scheme to improve alignment and enhance performance in applications which have moderate-to-large numbers of arrays, where various dimensionalities are involved in localized computation and array access patterns are sequential. These properties are common in many computational modeling applications. Furthermore, the scheme provides a single solution when an application is targeted to run on various numbers of processors in power-of-two sizes. The applicability of the proposed scheme is demonstrated on testbed code for an air quality modeling problem.

[1] J. Brooks, "Single PE Optimization Techniques for the CRAY T3D System," Cray Research, Oct.20 1994.
[2] S. Coleman and K. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[3] Cray T3D: Technical Summary. Cray Research, Inc., Sept. 1993.
[4] J.J. Dongarra, J.D. Croz, S. Hammarling, and I. Duff, "A Set of Level 3 Basic Linear Algebra Subprograms," ACM Trans. Mathematical Software, vol. 16, no. 1, pp. 1-17, Mar. 1990.
[5] Z. Fang, "Cache or Local Memory Thrashing and Compiler Strategy in Parallel Processing Systems," Proc. 1990 Int'l Conf. Parallel Processing, vol. II, pp. 271-275, 1990.
[6] J. Fang and M. Lu, "An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing," IEEE Trans. Computers, vol. 42, no. 5, May 1993.
[7] M.D. Hill and J.R. Larus, "Cache Considerations for Multiprocessor Programmers," Comm. ACM, vol. 33, no. 8, pp. 97-102, Aug. 1990.
[8] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[9] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[10] J.-H. Lee, M.-Y. Lee, S.-U. Choi, and M.-S. Park, "Reducing Cache Conflicts in Data Cache Prefetching," Computer Architecture News, vol. 22, no. 4, pp. 71-77, 1994.
[11] S. McFarling, ``Cache Replacement with Dynamic Exclusion,'' Proc. 19th ISCA, pp. 191-200, May 1992.
[12] A. Meltzer, Programming for Performance in CRAFT on the T3D. Cray Research, Inc., July28 1994.
[13] T. Mowry, "Tolerating Latency Through Software Controlled Data Prefetching," PhD Thesis, Dept. of Computer Science, Stanford Univ., Palo, Alto, Calif., Mar. 1994.
[14] S. Przybylski, M. Howrowitz, and J. Hennessy, "Performance Tradeoffs in Cache Design," Proc. 15th Int'l Symp. Computer Architecture, pp. 290-298, June 1988.
[15] H.S. Stone, High-Performance Computer Architecture.Reading, Mass.: Addison-Wesley, 1990.
[16] O. Temam, E.D. Granston,, and W. Jalby, “To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should Be Used to Eliminate Cache Conflicts,” Proc. Supercomputing, Nov. 1993.
[17] O. Temam and N. Drach, "Software Assistance for Data Caches," Proc. First IEEE Symp. High-Performance Computer Architecture, pp. 154-163,Raleigh, N.C., Jan.22-25 1995.
[18] J. Torrellas, C. Xia, and R. Daigle, “Optimizing Instruction Cache Performance for Operating System Intensive Workloads,” Proc. First Int'l Symp. High-Performance Computer Architecture, pp. 360-369, Jan. 1995.
[19] C. Vanden Eynden, Elementary Number Theory, first ed. Random House, 1987.
[20] S. Venugopal, "Automatic Reorganization of Loops to Reduce Cache Conflicts," Technical Report DCS-TR-274, Dept. of Computer Science, Laboratory for Computer Science Research, Rutgers Univ., Jan. 1991.
[21] S. Venugopal and W. Eventoff, "Automatic Transformation of FORTRAN Loops to Reduce Cache Conflicts," Proc. 1991 Int'l Conf. Supercomputing, pp. 183-193,Cologne, Germany, June17-21 1991.
[22] H. Weberpals, "Designing Vector Algorithms with Data Locality," Proc. Parallel Computing '89, pp. 419-424.North-Holland: Elsevier Science Publishers B.V., 1990.
[23] M. Wolfe, “Iteration Space Tiling for Memory Hierarchies,” Proc. Third SIAM Conf. Parallel Processing for Scientific Computing, Dec. 1987.
[24] M. Wolfe, “More Iteration Space Tiling,” Proc. Supercomputing '89, pp. 655-664, Nov. 1989.
[25] J.O. Young, E.D. Sills, and D.A. Jorge, "Optimization of the Regional Oxidant Model for the Cray Y-MP," Technical Report EPA/600/R-94/065, U.S. Environmental Protection Agency, Research Triangle Park, N.C., Jan. 1993.

Index Terms:
Cache collision, cache offset, direct-mapped cache, highly parallel systems, sequential DO-loops.
David C. Wong, Edward W. Davis, Jeffrey O. Young, "A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 6, pp. 601-608, June 1998, doi:10.1109/71.689447
Usage of this product signifies your acceptance of the Terms of Use.