The Community for Technology Leaders
RSS Icon
Issue No.04 - April (2011 vol.60)
pp: 526-537
Lei Jin , University of Pittsburgh, Pittsburgh, PA
Sangyeun Cho , University of Pittsburgh, Pittsburgh, PA
This paper presents a study on macro data load, a novel mechanism to increase the amount of loaded data reuse within a processor. A macro data load brings into the processor a maximum-width data the cache port allows. In a 64-bit processor, for example, a byte load will bring a full 64-bit data from cache and save it in an internal hardware structure, while using for itself only the specified byte out of the 64-bit data. The saved data can be opportunistically reused by later loads internally, reducing relatively more expensive cache accesses. We present a comprehensive availability study using a generalized memory data reuse table (MDRT) to quantify available memory data reuse opportunities in a set of benchmark programs drawn from the SPEC2k and MiBench suites, and to demonstrate the efficacy of the proposed scheme. The macro data load mechanism is shown to open up significantly more loaded data reuse opportunities than previous schemes with no support for spatial locality. We observe 15.1 percent (SPEC2k integer), 20.9 percent (SPEC2k floating-point), and 45.8 percent (MiBench) more load-to-load forwarding instances when a 256-entry MDRT is used. We also describe a modified load store queue design as a possible implementation of the proposed concept. Our quantitative study using a realistic processor model shows that 21.3 percent, 14.8 percent, and 23.6 percent of L1 cache accesses in the SPEC2k integer, floating-point, and MiBench programs can be eliminated, resulting in a related energy reduction of 11.4 percent, 9.0 percent, and 14.3 percent on average, respectively.
Superscalar processor architecture, cache memory, load store queue, low-power processor design.
Lei Jin, Sangyeun Cho, "Macro Data Load: An Efficient Mechanism for Enhancing Loaded Data Reuse", IEEE Transactions on Computers, vol.60, no. 4, pp. 526-537, April 2011, doi:10.1109/TC.2010.131
[1] T. Austin, E. Larson, and D. Ernst, "SimpleScalar: An Infrastructure for Computer System Modeling," Computer, vol. 35, no. 2, pp. 59-67, Feb. 2002.
[2] R. Bhargava, L.K. John, B.L. Evans, and R. Radhakrishnan, "Evaluating MMX Technology Using DSP and Multimedia Applications," Proc. Int'l Symp. Microarchitecture (MICRO '98), pp. 37-46, Dec. 1998.
[3] R. Bodik, R. Gupta, and M.L. Soffa, "Load-Reuse Analysis: Design and Evaluation," Proc. Int'l Conf. Programming Language Design and Implementation (PLDI '99), pp. 64-76, May 1999.
[4] D. Boggs, A. Baktha, J. Hawkins, D.T. Marr, J.A. Miller, P. Roussel, R. Singhal, B. Toll, and K.S. Venkatraman, "The Microarchitecture of the Intel Pentium 4 Processor on 90nm Technology," J. Intel Technology, vol. 8, no. 1, Feb. 2004.
[5] D. Bradley, P. Mahoney, and B. Stackhouse, "The 16kB Single-Cycle Read Access Cache on a Next-Generation 64b Itanium Microprocessor," Proc. Int'l Solid State Circuits Conf. (ISSCC '02), pp. 110-111, Feb. 2002.
[6] D. Brooks, V. Tiwari, and M. Martonosi, "Wattch: A Framework for Architectural-Level Power Analysis and Optimizations," Proc. Int'l Symp. Computer Architecture (ISCA '00), pp. 83-94, May 2000.
[7] J.F. Cantin and M.D. Hill, "Cache Performance for Selected SPEC CPU2000 Benchmarks," ACM SIGARCH Computer Architecture News, vol. 29, no. 4, pp. 13-18, Sept. 2001.
[8] S. Cho, P.-C. Yew, and G. Lee, "Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor," Proc. Int'l Symp. Computer Architecture (ISCA '99), pp. 100-110, May 1999.
[9] J.W. Davidson and S. Jinturkar, "Memory Access Combining: A Technique for Eliminating Redundant Memory Accesses," Proc. Int'l Conf. Programming Language Design and Implementation (PLDI '94), pp. 186-195, June 1994.
[10] K. Diefendorff, "K7 Challenges Intel," Microprocessor Report, vol. 12, no. 14, pp. 1-7, Oct. 1998.
[11] A. Garg, M. Wasiur Rashid, and M. Huang, "Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification," Proc. Int'l Symp. Computer Architecture (ISCA '06), pp. 142-154, June 2006.
[12] K. Ghose and M.B. Kamble, "Reducing Power in Superscalar Processor Caches Using Subbanking, Multiple Line Buffers and Bit-Line Segmentation," Proc. Int'l Symp. Low-Power Electronics and Design (ISLPED '99), pp. 70-75, Aug. 1999.
[13] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown, "MiBench: A Free, Commercially Representative Embedded Benchmark Suite," Proc. Ann. Workshop Workload Characterization (WWC '01), Dec. 2001.
[14] J.M. Hart, K.T. Lee, D. Chen, L. Cheng, C. Chou, A. Dixit, D. Greenley, G. Gruber, K. Ho, J. Hsu, N.G. Malur, and J. Wu, "Implementation of a Fourth-Generation 1.8-GHz Dual-Core SPARC V9 Microprocessor," IEEE J. Solid-State Circuits (JSSC), vol. 41, no. 1, pp. 210-217, Jan. 2006.
[15] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, fourth ed. Morgan Kaufmann Publishers, 2007.
[16] R. Huang, A. Garg, and M. Huang, "Software-Hardware Cooperative Memory Disambiguation," Proc. Int'l Symp. High-Performance Computer Architecture (HPCA '06), pp. 244-253, Feb. 2006.
[17] Intel Corp, "Intel Itanium Processor Family Reference Guide: IA-32 Execution Layer," Quick Reference Guide, 2004.
[18] Intel Corp, "Intel 64 and IA-32 Architectures Optimization Reference Manual," Order # 248966-017, Dec. 2008.
[19] L. Jin and S. Cho, "Reducing Cache Traffic and Energy with Macro Data Load," Proc. Int'l Symp. Low Power Electronics and Design (ISLPED '06), pp. 147-150, Oct. 2006.
[20] M. Johnson, Superscalar Microprocessor Design. Prentice Hall, 1991.
[21] R.E. Kessler, "The Alpha 21264 Microprocessor," IEEE Micro, vol. 19, no. 2, pp. 24-36, Mar./Apr. 1999.
[22] J. Kin, M. Gupta, and W.H. Mangione-Smith, "The Filter Cache: An Energy Efficient Memory Structure," Proc. Int'l Symp. Microarchitecture (MICRO '97), pp. 184-193, Dec. 1997.
[23] M. Loney, "ARM to Unveil 64-bit Jaguar Chip for Handhelds,",,1000000091, 2097352,00.htm , Oct. 2001.
[24] A. Moshovos and G. Sohi, "Read-after-Read Memory Dependence Prediction," Proc. Int'l Symp. Microarchitecture (MICRO '99), pp. 177-185, Nov. 1999.
[25] D. Nicolaescu, A. Veidenbaum, and A. Nicolau, "Reducing Data Cache Energy Consumption via Cached Load/Store Queue," Proc. Int'l Symp. Low-Power Electronics and Design (ISLPED '03), pp. 252-257, Aug. 2003.
[26] D. Nicolaescu, A. Veidenbaum, and A. Nicolau, "Caching Values in the Load Store Queue," Proc. Int'l Symp. Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS '04), pp. 580-587, Oct. 2004.
[27] S. Önder and R. Gupta, "Load and Store Reuse Using Register File Contents," Proc. Int'l Conf. Supercomputing (ICS '01), pp. 289-302, June 2001.
[28] A. Roth, "A High-Bandwidth Load/Store Unit for Single- and Multi-Threaded Processors," CIS Technical Report MS-CIS-04-09, Pennsylvania Univ. June 2004.
[29] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, B. Cherkauer, J. Stinson, J. Benoit, R. Varada, J. Leung, R.D. Limaye, and S. Vora, "A 65-nm Dual-Core Multithreaded Xeon Processor with 16-MB L3 Cache," IEEE J. Solid-State Circuits, vol. 42, no. 1, pp. 17-25, Jan. 2007.
[30] S. Sethumadhavan, F. Roesner, J.S. Emer, D. Burger, and S.W. Keckler, "Late-Binding: Enabling Unordered Load-Store Queues," Proc. Int'l Symp. Computer Architecture (ISCA '07), pp. 347-357, June 2007.
[31] T. Sha, M.M.K. Martin, and A. Roth, "Scalable Store-Load Forwarding via Store Queue Index Prediction," Proc. Int'l Symp. Microarchitecture (MICRO '05), pp. 177-185, Nov. 2005.
[32] T. Sha, M.M.K. Martin, and A. Roth, "NoSQ: Store-Load Forwarding without a Store Queue," Proc. Int'l Symp. Microarchitecture (MICRO '06), pp. 285-296, Dec. 2006.
[33] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, "Automatically Characterizing Large Scale Program Behavior," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '02), pp. 45-57, Oct. 2002.
[34] S. Thoziyoor, N. Muralimanohar, J.H. Ahn, and N.P. Jouppi, "CACTI 5.1," HP Research Technical Report HPL-2008-20, Apr. 2008.
[35] B. Sinharoy, R.N. Kalla, J.M. Tendler, R.J. Eickemeyer, and J.B. Joyner, "POWER5 System Microarchitecture," IBM J. Research and Development, vol. 49, nos. 4/5, July 2005.
[36] A.J. Smith, "Cache Memories," ACM Computing Surveys, vol. 14, no. 3, pp. 473-530, Sept. 1982.
[37] "Standard Performance Evaluation Corporation," http:/www., 2010.
[38] C.-L. Su and A.M. Despain, "Cache Designs for Energy Efficiency," Proc. Hawaii Int'l Conf. System Sciences (HICSS '95), pp. 306-315, Jan. 1995.
[39] S. Subramaniam and G.H. Loh, "Fire-and-Forget: Load/Store Scheduling with No Store Queue at All," Proc. Int'l Symp. Microarchitecture (MICRO '06), pp. 273-284, Dec. 2006.
[40] T. Takayanagi, J.L. Shin, B. Petrick, J.Y. Su, H. Levy, H. Pham, J. Son, N. Moon, D. Bistry, U. Nair, M. Singh, V. Mathur, and A.S. Leon, "A Dual-Core 64-bit UltraSPARC Microprocessor for Dense Server Applications," IEEE J. Solid-State Circuits (JSSC), vol. 40, no. 1, pp. 7-18, Jan. 2005.
[41] J. Yang and R. Gupta, "Load Redundancy Removal through Instruction Reuse," Proc. Int'l Conf. Parallel Processing (ICPP '00), pp. 61-68, Aug. 2000.
[42] K.C. Yeager, "The MIPS R10000 Superscalar Microprocessor," IEEE Micro, vol. 16, no. 2, pp. 28-40, Apr. 1996.
[43] C. Zdebel and S. Solotko, "The AMD64 Computing Platform: Your Link to the Future of Computing," White Paper, May 2003.
42 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool