Issue No.06 - June (2007 vol.18)
Dan Tsafrir , IEEE
Yoav Etsion , IEEE
Dror G. Feitelson , IEEE
<p><b>Abstract</b>—The most commonly used scheduling algorithm for parallel supercomputers is FCFS with backfilling, as originally introduced in the EASY scheduler. Backfilling means that short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). To make such determinations possible, users are required to provide estimates of how long jobs will run, and jobs that violate these estimates are killed. Empirical studies have repeatedly shown that user estimates are inaccurate, and that system-generated predictions based on history may be significantly better. However, predictions have not been incorporated into production schedulers, partially due to a misconception (that we resolve) claiming inaccuracy actually improves performance, but mainly because underprediction is technically unacceptable: Users will not tolerate jobs being killed just because system predictions were too short. We solve this problem by divorcing kill-time from the runtime prediction and correcting predictions adaptively as needed if they are proved wrong. The end result is a surprisingly simple scheduler, which requires minimal deviations from current practices (e.g., using FCFS as the basis) and behaves exactly like EASY as far as users are concerned; nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counterintuitively, our results indicate that using recent data is more important than mining the history for similar jobs. All the techniques suggested in this paper can be used to enhance any backfilling algorithm and are not limited to EASY.</p>
Parallel job scheduling, backfilling, runtime estimates, system-generated predictions, history-based predictions, dynamic prediction correction, performance metrics, EASY, EASY++, SJBF.
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson, "Backfilling Using System-Generated Predictions Rather than User Runtime Estimates", IEEE Transactions on Parallel & Distributed Systems, vol.18, no. 6, pp. 789-803, June 2007, doi:10.1109/TPDS.2007.70606