This Article 
 Bibliographic References 
 Add to: 
Optimizing Overall Loop Schedules Using Prefetching and Partitioning
June 2000 (vol. 11 no. 6)
pp. 604-614

Abstract—In this paper, a method combining the loop pipelining technique with data prefetching, called Partition Scheduling with Prefetching (PSP), is proposed. In PSP, the iteration space is first divided into regular partitions. Then a two-part schedule, consisting of the ALU and memory parts, is produced and balanced to produce high throughput. These two parts are executed simultaneously, and hence, the remote memory latencies are overlapped. We study the optimal partition shape and size so that a well-balanced overall schedule can be obtained. Experiments on DSP benchmarks show that the proposed methodology consistently produces optimal or near optimal solutions.

[1] H.F. Al-Sukhni, H. Youssef, S.M. Sait, and M.S.T. Benten, “Loop Based Scheduling for High Level Synthesis,” Proc. 14th Ann. Int'l Phoenix Conf. Computers and Comm., pp. 76–81, Mar. 1995.
[2] R. Bianchini, R. Pinto, and C.L. Amorim, “Data Prefetching for Software DSMs,” Proc. 1998 Int'l Conf. Supercomputing, pp. 385–392, July 1998.
[3] F. Chen, S. Tongsima, and E.H.-M. Sha, “Loop Scheduling Optimization with Data Prefetching Based on Multidimensional Retiming,” Proc. ISCA 11th Int'l Conf. Parallel and Distributed Computing Systems, pp. 129–134, 1998.
[4] T.-F. Chen and J.-L. Baer, "A Performance Study of Software and Hardware Data Prefetching Schemes," Proc. 21st Int'l Symp. Computer Architecture, pp. 223-232, 1994.
[5] F. Dahlgren, M. Dubois, and P. Stenstrom, "Sequential Hardware Prefetching in Shared Memory Multiprocessors," IEEE Trans. Parallel and Distributed Systems, Vol. 6, No. 7, July 1995, pp. 733-746.
[6] D. Lavery and W.-M Hwu, “Modulo Scheduling of Loops in Control-Intensive Non-Numerical Programs,” Proc. 29 Ann. Workshop Microprogramming (Micro-29), pp. 126-137, 1996.
[7] N. Manjikian, “Combining Loop Fusion with Prefetching on Shared-Memory Multiprocessors,” Proc. Int'l Conf. Parallel Processing, pp. 78–82, 1997.
[8] T. Ozawa et al. “Cache Miss Heuristics and Preloading Techniques for General-Purpose Programs,” Proc. 28th Ann. Int'l Symp. Microarchitecture, pp. 243-248, Nov. 1995.
[9] N.L. Passos and E.H.-M. Sha, “Achieving Full Parallelism Using Multidimensional Retiming,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 11, pp. 1,150–1,163, Nov. 1996.
[10] N.L. Passos and E.H.-M. Sha, “Scheduling of Uniform Multidimensional Systems under Resource Constraints,” IEEE Trans. VLSI Systems, vol. 6, no. 4, pp. 719–730, Dec. 1998.
[11] J. Philbin, J. Edler, O.J. Anshus, C.C. Douglas, and K. Li, “Thread Scheduling for Cache Locality,” Computer Architecture News, pp. 60–71, Oct. 1996.
[12] S.S. Pinter and A. Yoaz, “Tango: A Hardware-Based Data Prefetching Technique for Superscalar Processors,” Proc. MICRO-29, pp. 214–225, 1996.
[13] J. Skeppstedt and M. Dubois, “Hybrid Compiler/Hardware Prefetching for Multiprocessors Using Low-Overhead Cache Miss Traps,” Proc. Int'l Conf. Parallel Processing, pp. 298–305, 1997.
[14] M.K. Tcheun, H. Yoon, and S.R. Maeng, “An Adaptive Sequential Prefetching Scheme in Shared-Memory Multiprocessors,” Proc. Int'l Conf. Parallel Processing, pp. 306–313, 1997.
[15] S. Wallace and N. Bagherzadeh, “Modeled and Measured Instruction Fetching Performance for Superscalar Microprocessors,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 6, pp. 570–578, Jun. 1998.
[16] C.-Y. Wang and K.K. Parhi, “Resource-Constrained Loop List Scheduler for DSP Algorithms,” J. VLSI Signal Processing, vol. 11,nos. 1–2, pp. 75–96, Oct.–Nov. 1995.
[17] M. Wolf, D. Maydan, and D. Chen, “Combining Loop Transformations Considering Caches and Scheduling,” Proc. MICRO-29, pp. 274-286, Dec. 1996.
[18] Y. Yamada, J. Gyllenhall, G. Haab, and W.-M. Hwu, “Data Relocation and Prefetching for Programs with Large Data Sets,” Proc. MICRO-27, pp. 118–127, 1994.

Index Terms:
Prefetching, retiming, scheduling, partitioning, latency-hiding.
Fei Chen, Timothy W. O'Neil, Edwin H.-M. Sha, "Optimizing Overall Loop Schedules Using Prefetching and Partitioning," IEEE Transactions on Parallel and Distributed Systems, vol. 11, no. 6, pp. 604-614, June 2000, doi:10.1109/71.862210
Usage of this product signifies your acceptance of the Terms of Use.