This Article 
 Bibliographic References 
 Add to: 
Workload-Cognizant Concurrent Error Detection in the Scheduler of a Modern Microprocessor
September 2011 (vol. 60 no. 9)
pp. 1274-1287
Naghmeh Karimi, University of Tehran, Tehran
Michail Maniatakos, Yale University, New Haven
Abhijit Jas, Intel Corporation, Austin
Chandrasekharan (Chandra) Tirumurti, Intel Corporation, Santa Clara
Yiorgos Makris, Yale University, New Haven
We present a Concurrent Error Detection (CED) scheme for the Scheduler of a modern microprocessor. The proposed CED scheme is based on monitoring a set of invariances imposed through added hardware, violation of which signifies the occurrence of an error. The novelty of our solution stems from the workload-cognizant way in which these invariances are selected so that they leverage the application-level error masking inherent in program execution. Specifically, in order to ensure cost-effectiveness of the hardware employed to construct these invariances, we make use of information regarding the type and frequency of errors affecting the typical workload of the microprocessor. Thereby, we identify the most susceptible aspects of instruction execution and we accordingly distribute CED resources to protect them. Our approach is demonstrated on the Scheduler of an Alpha-like superscalar microprocessor with dynamic scheduling, hybrid branch prediction and out-of-order execution capabilities. Using an extensive fault-simulation infrastructure that we developed around this microprocessor, we profile the impact of Scheduler faults across a variety of different SPEC2000 benchmarks. Based on the results, we construct a CED scheme which monitors the time and location of instruction execution, the executed operation, the utilized resources, as well as the executed and retired sequence of instructions. At a hardware cost of only 32 percent of the Scheduler, the corresponding CED scheme detects over 85 percent of its faults that affect the architectural state of the microprocessor. Furthermore, over 99.5 percent of these faults are detected before they corrupt the architectural state, while the average detection latency for the remaining faults is in the order of a few clock cycles, implying that efficient recovery methods can be developed.

[1] S. Matakias, Y. Tsiatouhas, A. Arapoyanni, and T. Haniotakis, “A Circuit for Concurrent Detection of Soft and Timing Errors in Digital CMOS ICs,” J. Electronic Testing: Theory and Applications, vol. 20, no. 5, pp. 523-531, 2004.
[2] P. Hazucha, C. Svensson, and S.A. Wender, “Cosmic-Ray Soft Error Characterization of a Standard 0.6μm CMOS Process,” IEEE J. Solid-State Circuits, vol. 35, no. 10, pp. 1422-1429, Oct. 2000.
[3] E. Normand, “Single Event Upset at Ground Level,” IEEE Trans. Nuclear Science, vol. 43, no. 6, pp. 2742-2750, Dec. 1996.
[4] Y. Tosaka, S. Satoh, T. Itakura, H. Ehara, T. Ueda, G.A. Woffinden, and S.A. Wender, “Measurement and Analysis of Neutron-Induced Soft Errors in Sub-Half-Micron CMOS Circuits,” IEEE Trans. Electron Devices, vol. 45, no. 7, pp. 1453-1458, July 1998.
[5] C. Metra, M. Favalli, and B. Ricco, “On-Line Detection of Logic Errors Due to Crosstalk, Delay, and Transient Faults,” Proc. Int'l Test Conf., pp. 524-533, 1998.
[6] M. Goessel and S. Graf, Error Detection Circuits. McGraw-Hill, 1993.
[7] S. Mitra and E.J. McCluskey, “Which Concurrent Error Detection Scheme to Choose?,” Proc. Int'l Test Conf., pp. 985-994, 2000.
[8] K. Mohanram and N.A. Touba, “Cost-Effective Approach for Reducing Soft Error Rate in Logic Circuits,” Proc. Int'l Test Conf., pp. 893-901, 2003.
[9] S. Mitra and E.J. McCluskey, “Design Diversity for Concurrent Error Detection in Sequential Logic Circuits,” Proc. Very Large Scale Integration (VLSI) Test Symp., pp. 178-183, 2001.
[10] A. Avizienis and J.P.J. Kelly, “Fault Tolerance by Design Diversity: Concepts and Experiments,” Computer, vol. 17, no. 8, pp. 67-80, 1984.
[11] G. Aksenova and E. Sogomonyan, “Design of Self-Checking Built-in Check Circuits for Automata with Memory,” Automation and Remote Control, vol. 36, no. 7, pp. 1169-1177, 1975.
[12] S. Dhawan and R.C. De Vries, “Design of Self-Checking Sequential Machines,” IEEE Trans. Computers, vol. 37, no. 10, pp. 1280-1284, Oct. 1988.
[13] C. Zeng, N. Saxena, and E.J. McCluskey, “Finite State Machine Synthesis with Concurrent Error Detection,” Proc. Int'l Test Conf., pp. 672-679, 1999.
[14] M. Pflanz, K. Walther, C. Galke, and H.T. Vierhaus, “On-Line Error Detection and Correction in Storage Elements with Cross-Parity Check,” Proc. Int'l On-Line Test Workshop, pp. 69-73, 2002.
[15] D. Das and N.A. Touba, “Synthesis of Circuits with Low-Cost Concurrent Error Detection Based on Bose-Lin Codes,” Proc. Very Large Scale Integration (VLSI) Test Symp., pp. 309-315, 1998.
[16] N.K. Jha and S.-J Wang, “Design and Synthesis of Self-Checking VLSI Circuits,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 12, no. 6, pp. 878-887, June 1993.
[17] S. Almukhaizim, P. Drineas, and Y. Makris, “Entropy-Driven Parity-Tree Selection for Low-Overhead Concurrent Error Detection in Finite State Machines,” IEEE Trans. CAD of Integrated Circuits and Systems, vol. 25, no. 8, pp. 1547-1554, Aug. 2006.
[18] J.H. Patel and L.Y. Fung, “Concurrent Error Detection in ALUs by Recomputing with Shifted Operands,” IEEE Trans. Computers, vol. 31, no. 7, pp. 589-595, July 1982.
[19] S. Almukhaizim, P. Drineas, and Y. Makris, “On Concurrent Error Detection with Bounded Latency in FSMs,” Proc. Conf. Design, Automation and Test, vol. 1, pp. 596-601, 2004.
[20] R. Vemu, A. Jas, J.A. Abraham, S. Patil, and R. Galivanche, “A Low-Cost Concurrent Error Detection Technique for Processor Control Logic,” Proc. Conf. Design, Automation and Test in Europe, pp. 897-902, 2008.
[21] Y. Makris, I. Bayraktaroglu, and A. Orailoglu, “Enhancing Reliability of RTL Controller-Datapath Circuits via Invariant-Based Concurrent Test,” IEEE Trans. Reliability, vol. 53, no. 2, pp. 269-278, June 2004.
[22] C. Metra, D. Rossi, M. Omana, A. Jas, and R. Galivanche, “Function-Inherent Code Checking: A New Low Cost On-line Testing Approach for High Performance Microprocessor Control Logic,” Proc. European Test Symp., pp. 171-176, 2008.
[23] A. Mahmood and E.J. McCluskey, “Concurrent Error Detection Using Watchdog Processors—A Survey,” IEEE Trans. Computers, vol. 37, no. 2, pp. 160-174, Feb. 1988.
[24] M. Jafari-Nodoushan, S. Ghassem-Meremadi, and A. Ejlali, “Control Flow Checking Using Branch Instructions,” Proc. Int'l Conf. Embedded and Ubiquitous Computing, pp. 66-72, 2008.
[25] A. Mendelson and N. Suri, “Designing High-Performance and Reliable Superscalar Architectures—The Out of Order Reliable Superscalar (O3RS) Approach,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 5-28, 2000.
[26] J.B. Nickel and A.K. Somani, “REESE: A Method of Soft Error Detection in Microprocessors,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 401-410, 2001.
[27] N.J. Wang, J. Quek, T.M. Rafacz, and S.J. Patel, “Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 61-70, 2004.
[28] S.S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, and T. Austin, “A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor,” Proc. Int'l Symp. Microarchitecture, pp. 29-40, 2003.
[29] N. Karimi, M. Maniatakos, Y. Makris, and A. Jas, “On the Correlation between Controller Faults and Instruction-Level Errors in Modern Microprocessors,” Proc. Int'l Test Conf., pp. 24.1.1-24.1.10, 2008.
[30] N.J. Wang, A. Mahesri, and S.J. Patel, “Examining ACE Analysis Reliability Estimates Using Fault-Injection,” SIGARCH Computer Architecture News, vol. 35, no. 2, pp. 460-469, 2007.
[31] D. Burger, T.M. Austin, and S. Bennett, “Evaluating Future Microprocessors: The Simplescalar Tool Set,” Technical Report CS-TR-1996-1308, Intel Corporation, 1996.
[32] J.C. Baraza, J. Garcia, S. Blanc, D. Gil, and P.J. Gil, “Enhancement of Fault Injection Techniques Based on the Modification of VHDL Code,” IEEE Trans. Very Large Scale Integration (VLSI), vol. 16, no. 6, pp. 693-706, June 2008.
[33] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, third ed. Morgan Kaufmann, 2003.
[34] M. Maniatakos, N. Karimi, C. Tirumurti, A. Jas, and Y. Makris, “Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller,” IEEE Trans. Computers, vol. 60, no. 9, pp. 1260-1273, 2011.
[35] M. Maniatakos, N. Karimi, Y. Makris, A. Jas, and C. Tirumurti, “Design and Evaluation of a Timestamp-Based Concurrent Error Detection Method (CED) in a Modern Microprocessor Controller,” Proc. Int'l Symp. Defect and Fault Tolerance in VLSI Systems, pp. 454-462, 2008.

Index Terms:
Concurrent error detection, microprocessor, scheduler, invariance.
Naghmeh Karimi, Michail Maniatakos, Abhijit Jas, Chandrasekharan (Chandra) Tirumurti, Yiorgos Makris, "Workload-Cognizant Concurrent Error Detection in the Scheduler of a Modern Microprocessor," IEEE Transactions on Computers, vol. 60, no. 9, pp. 1274-1287, Sept. 2011, doi:10.1109/TC.2010.265
Usage of this product signifies your acceptance of the Terms of Use.