This Article 
 Bibliographic References 
 Add to: 
AGAMOS: A Graph-Based Approach to Modulo Scheduling for Clustered Microarchitectures
June 2009 (vol. 58 no. 6)
pp. 770-783
Alex Aletà, UPC, Barcelona
Josep M. Codina, Intel Labs Barcelona, Barcelona
Jesús Sánchez, Intel Labs, Barcelona
Antonio González, Intel Labs Barcelona, Barcelona
David Kaeli, Northeastern University, Boston
This paper presents AGAMOS, a technique to modulo schedule loops on clustered microarchitectures. The proposed scheme uses a multilevel graph partitioning strategy to distribute the workload among clusters and reduces the number of intercluster communications at the same time. Partitioning is guided by approximate schedules (i.e., pseudoschedules), which take into account all of the constraints that influence the final schedule. To further reduce the number of intercluster communications, heuristics for instruction replication are included. The proposed scheme is evaluated using the SPECfp95 programs. The described scheme outperforms a state-of-the-art scheduler for all programs and different cluster configurations. For some configurations, the speedup obtained when using this new scheme is greater than 40 percent, and for selected programs, performance can be more than doubled.

[1] A. Aggarwal and M. Franklin, “Instruction Replication: Reducing Delays Due to Inter-PE Communication Latency,” Proc. Int'l Conf. Parallel Architectures and Compiler Techniques (PACT '03), Sept. 2003.
[2] A. Aletà, J.M. Codina, J. Sánchez, and A. González, “Graph-Partitioning Based Instruction Scheduling for Clustered Processors,” Proc. 34th Int'l Symp. Microarchitecture, Dec. 2001.
[3] A. Aletà, J.M. Codina, J. Sánchez, A. González, and D. Kaeli, “Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning,” Proc. Int'l Conf. Parallel Architectures and Compiler Techniques (PACT '02), Sept. 2002.
[4] A. Aletà, J.M. Codina, A. González, and D. Kaeli, “Instruction Replication for Clustered Microarchitectures,” Proc. 36th Int'l Symp. Microarchitecture, 2003.
[5] A. Aletà, J.M. Codina, A. González, and D. Kaeli, “Removing Communications in Clustered Microarchitectures through Instruction Replication,” ACM Trans. Architecture and Code Optimization (TACO), vol. 1, no. 2, pp.127-151, June 2004.
[6] A. Aletà, J.M. Codina, A. González, and D. Kaeli, “Heterogeneous Clustered VLIW Microarchitectures,” Proc. Fifth Int'l Symp. Code Generation and Optimization, pp. 354-366, 2007.
[7] J. Allen, K. Kennedy, and J. Warren, “Conversion of Control Dependence to Data Dependence,” Proc. 10th Ann. Symp. Principles of Programming Languages, Jan. 1983.
[8] E. Ayguadé, C. Barrado, A. González, J. Labarta, D. López, S. Moreno, D. Papua, F. Reig, Q. Riera, and M. Valero, “Ictineo: ATool for Research on ILP,” Proc. Conf. Supercomputing, 1996.
[9] P. Briggs, K.D. Cooper, and L. Torczon, “Rematerialization,” Proc. Special Interest Group on Programming Languages (SIGPLAN '92) Conf. Programming Language Design and Implementation, June 1992.
[10] A. Capitanio, N. Dutt, and A. Nicolau, “Partition Register Files for VLIW's: A Preliminary Analysis of Tradeoffs” Proc. 25th Int'l Symp. Microarchitecture (MICRO-25), 1992.
[11] G.J. Chaitin, M.A. Auslander, A.K. Chandra, J. Cocke, M.E. Hopkins, and P.W. Markstein, “Register Allocation via Coloring,” Computer Languages, pp.47-57, Jan. 1981.
[12] A. Charlesworth, “An Approach to Scientific Array Processing: The Architectural Design of the AP120B/FPS-164 Family,” Computer, vol. 14, no. 9, pp.18-27, Sept. 1981.
[13] M.L. Chu and S.A. Mahlke, “Compiler-Directed Data Partitioning for Multicluster Processors,” Proc. Int'l Symp. Code Generation and Optimization (CGO), 2006.
[14] J.M. Codina, J. Sánchez, and A. González, “A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, 2001.
[15] J.M. Codina, J. Llosa, and A. González, “A Compartive Study of Modulo Scheduling Techniques,” Proc. Int'l Conf. Supercomputing (ICS '02), June 2002.
[16] A.E. Eichenberger, E.S. Davidson, and S.G. Abraham, “Optimum Module Schedules for Minimum Register Requirements,” Proc. Conf. Supercomputing, 1995.
[17] P. Faraboschi, G. Brown, J. Fisher, G. Desoli, and F. Homewood, “Lx: A Technology Platform for Customizable VLIW Embedded Processing,” Proc. 27th Int'l Symp. Computer Architecture, June 2000.
[18] M.M. Fernandes, J. Llosa, and N. Topham, “Distributed Modulo Scheduling,” Proc. Int'l Symp. High-Performance Computer Architecture, pp.130-134, Jan. 1999.
[19] J. Fridman and Z. Greenfield, “The TigerSharc DSP Architecture,” IEEE Micro, vol. 20, no. 1, pp.66-76, Jan./Feb. 2000.
[20] P.N. Glaskowsky, “MAP1000 Unfolds at Equator,” Microprocessor Report, vol. 12, no. 16, Dec. 1998.
[21] J. Hiser, S. Carr, P.H. Sweany, and S.J. Beaty, “Register Assignment for Software Pipelining with Partitioned Register Banks,” Proc. 14th Int'l Parallel and Distributed Processing Symp., 2000.
[22] R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proc. IEEE, pp. 490-504, Apr. 2001.
[23] S. Jain, “Circular Scheduling: A New Technique to Perform Software Pipelining,” Proc. Int'l Conf. Programming Languages, Design and Implementation, 1991.
[24] G. Karpis and V. Kumar, “Analysis of Multilevel Graph Partitioning,” Proc. Seventh Supercomputing Conf., 1995.
[25] B. Kernighan and S. Lin, “An Effective Heuristic Procedure for Partitioning Graphs,” Bell Systems Technical J., 1970.
[26] B. Kruatrachue and T.G. Lewis, “Grain Size Determination for Parallel Processing,” IEEE Software, vol. 5, no. 1, pp.23-32, Jan. 1988.
[27] D. Kuras, S. Carr, and P. Sweany, “Value Cloning for Architectures with Partitioned Register Banks,” Proc. Workshop Compiler and Architecture Support for Embedded Systems, pp.1-5, Dec. 1998.
[28] J. Llosa, E. Ayguadé, A. González, and M. Valero, “Swing Modulo Scheduling,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '96), Oct. 1996.
[29] E. Nystrom and A.E. Eichenberger, “Effective Cluster Assignement for Modulo Scheduling,” Proc. 31st Int'l Symp. Microarchitecture, pp.103-114, 1998.
[30] E. Ozer, S. Banerjia, and T.M. Conte, “Unified Assign and Schedule: A New Approach to Scheduling for Clustered Register File Microarchitectures,” Proc. 31st Int'l Symp. Microarchitecture (MICRO-31), 1998.
[31] G.G. Pechanek and S. Vassiliadis, “The ManArray Embedded Processor Architecture,” Proc. 26th. Euromicro Conf.: “Informatics: Inventing the Future, ” Sept. 2000.
[32] B.R. Rau and C. Glaeser, “Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Schientific Computing,” Proc. 14th Ann. Microprogramming Workshop, pp.183-197, Oct. 1981.
[33] B.R. Rau, Iterative Modulo Scheduling. Hewlett-Packard Company, 1995.
[34] J. Sánchez and A. González, “Cache Sensitive Modulo Scheduling,” Proc. 30th Int'l Symp. Microarchitecture, 1997.
[35] J. Sánchez and A. González, “The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures,” Proc. 29th Int'l Conf. Parallel Processing, Aug. 2000.
[36] Texas Instruments, Inc., “TMS320C62x/67x CPU and Instruction Set Reference Guide,” 1998.
[37] J. Zalamea, J. Llosa, E. Ayguadé, and M. Valero, “Modulo Scheduling with Integrated Register Spilling for Clustered VLIW Architectures,” Proc. 34th Int'l Symp. Microarchitecture, Dec. 2001.

Index Terms:
Clustered microarchitectures, ILP, instruction replication, modulo scheduling, statically scheduled processors.
Alex Aletà, Josep M. Codina, Jesús Sánchez, Antonio González, David Kaeli, "AGAMOS: A Graph-Based Approach to Modulo Scheduling for Clustered Microarchitectures," IEEE Transactions on Computers, vol. 58, no. 6, pp. 770-783, June 2009, doi:10.1109/TC.2009.32
Usage of this product signifies your acceptance of the Terms of Use.