Issue No.09 - September (2011 vol.60)
Naghmeh Karimi , University of Tehran, Tehran
Michail Maniatakos , Yale University, New Haven
Abhijit Jas , Intel Corporation, Austin
Chandrasekharan (Chandra) Tirumurti , Intel Corporation, Santa Clara
Yiorgos Makris , Yale University, New Haven
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2010.265
We present a Concurrent Error Detection (CED) scheme for the Scheduler of a modern microprocessor. The proposed CED scheme is based on monitoring a set of invariances imposed through added hardware, violation of which signifies the occurrence of an error. The novelty of our solution stems from the workload-cognizant way in which these invariances are selected so that they leverage the application-level error masking inherent in program execution. Specifically, in order to ensure cost-effectiveness of the hardware employed to construct these invariances, we make use of information regarding the type and frequency of errors affecting the typical workload of the microprocessor. Thereby, we identify the most susceptible aspects of instruction execution and we accordingly distribute CED resources to protect them. Our approach is demonstrated on the Scheduler of an Alpha-like superscalar microprocessor with dynamic scheduling, hybrid branch prediction and out-of-order execution capabilities. Using an extensive fault-simulation infrastructure that we developed around this microprocessor, we profile the impact of Scheduler faults across a variety of different SPEC2000 benchmarks. Based on the results, we construct a CED scheme which monitors the time and location of instruction execution, the executed operation, the utilized resources, as well as the executed and retired sequence of instructions. At a hardware cost of only 32 percent of the Scheduler, the corresponding CED scheme detects over 85 percent of its faults that affect the architectural state of the microprocessor. Furthermore, over 99.5 percent of these faults are detected before they corrupt the architectural state, while the average detection latency for the remaining faults is in the order of a few clock cycles, implying that efficient recovery methods can be developed.
Concurrent error detection, microprocessor, scheduler, invariance.
Naghmeh Karimi, Michail Maniatakos, Abhijit Jas, Chandrasekharan (Chandra) Tirumurti, Yiorgos Makris, "Workload-Cognizant Concurrent Error Detection in the Scheduler of a Modern Microprocessor", IEEE Transactions on Computers, vol.60, no. 9, pp. 1274-1287, September 2011, doi:10.1109/TC.2010.265