This Article 
 Bibliographic References 
 Add to: 
Concurrency Extraction Via Hardware Methods Executing the Static Instruction Stream
July 1992 (vol. 41 no. 7)
pp. 826-841

Hardware solutions to low-level (semantic) concurrency extraction are presented, focusing on the reduction of both control-flow and dataflow inhibitors of concurrency in general-purpose and scientific instruction streams. In the first model, CONDEL-1, an input code control flow model based on the code's branch domains is used in the algorithm to detect the reduced procedural dependencies in the input code. This model allows branches to execute concurrently. The cost and delay of the model's concurrency hardware are demonstrated to be relatively low, especially for the detection of concurrency beyond branches. The reduced procedural dependence techniques of CONDEL-1 are combined with high-speed reduced data dependency techniques to yield a machine model, CONDEL-2, executing standard sequential code in a manner beyond data-flow. Simulation results are presented and analyzed, showing the model's functionality and performance improvement. The beneficial effects of limited software optimizations are also reviewed.

[1] R. D. Acosta, J. Kjelstrup, and H. C. Torng, "An instruction issuing approach to enhancing performance in multiple functional unit processors,"IEEE Trans. Comput., vol. C-35, pp. 815-828, Sept. 1986.
[2] D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo, "The IBM System/360 Model 91: Machine philosophy and instruction-handling,"IBM J., pp. 8-24, Jan. 1967.
[3] F. Baskett, "The Puzzle benchmark," This is an undocumented compute-bound program; obtained via M. Rose of CMU and S. Wakefield of Stanford Univ.; fits a piece into a 3-D puzzle.
[4] A. J. Bernstein, "Analysis of programs for parallel processing,"IEEE Trans. Electron. Comput., vol. EC-15, pp. 757-763, Oct. 1966.
[5] M. Butler, T-Y. Yeh, Y. Patt, M. Alsup, H. Scales, and M. Shebanow, "Single instruction stream parallelism is greater than two," inProc. 18th Annu. Int. Symp. Comput. Architecture, Toronto, Canada, IEEE and ACM, May 1991, pp. 276-286.
[6] Carnegie-Mellon Univ., MCF Test Programs and Data Specification. Benchmark programs in a generic language, and instructions for their coding and use.
[7] J. M. Chambers, "Algorithm 410--Partial sorting,"Commun. ACM, vol. 14, pp. 357-358, May 1971.
[8] P. Chang, S. Mahlke, W. Chen, N. Warter, and W. Hwu, "IMPACT: An architectural framework for multiple-instruction-issue processors," inProc. 18th Annu. Int. Symp. Comput. Architecture, Toronto, Canada, IEEE and ACM, May 1991, pp. 266-275.
[9] R. P. Colwell, R. P. Nix, J. J. O'Donnell, D. B. Papworth, and P. K. Rodman, "A VLIW architecture for a trace scheduling compiler,"IEEE Trans. Comput., C-37, no. 8, pp. 967-979, Aug. 1988.
[10] R. G. Cytron, "Doacross: Beyond vectorization for multiprocessors (extended abstract)," inProc. 1986 Inf. Conf. Parallel Processing, Penn State Univ. and the IEEE Computer Society, Aug. 1986, pp. 836-844.
[11] J. A. Fisher, "Trace scheduling: A technique for global microcode compaction,"IEEE Trans. Comput., vol. C-30, no. 7, July 1981.
[12] J. A. Fisher, "Very long instruction word architectures and the ELI-512," inProc. 10th Annu. Int. Symp. Comput. Architecture, ACM-SIGARCH and the IEEE Computer Society, June 1983, pp. 140-150.
[13] P. Grogono,Programming in PASCAL. Reading, MA: Addison-Wesley, 1980.
[14] L. W. Hoevel and S. Wakefield, The HardShuffle Program. Benchmark moving (shuffling) data between two arrays.
[15] K. J. Hughes, "Advanced execution matrix design," Undergraduate Proiect report, Dep. Elec. Comput. Eng., Carnegie-Mellon University, Pittsburgh, PA, May 3, 1984.
[16] W. Hwu, S. Melvin, M. Shebanow, C. Chen, J. Wei, and Y. Patt, "An HPS implementation of VAX; Initial design and analysis," inProc. Nineteenth Annu. Hawaii Int. Conf. Syst. Sci., Univ. Hawaii, in cooperation with the ACM and the IEEE Computer Society, Jan. 1986.
[17] W. Hwu and Y. Patt, "HPSm, A high performance restricted data flow architecture having minimal functionality," inProc. 13th Annu. Symp. Comput. Architecture, ACM-IEEE, June 1986, pp. 297-306.
[18] J. H. Jacobs, A. K. Uht, and R. C. Ord, "Modeling the effects of instruction queue loading on a static instruction stream-micro-architecture," inProc. 21st Annu. Workshop Microprogramming and Microarchitecture, ACM-IEEE, Nov./Dec. 1988.
[19] D. E. Knuth, "An empirical study of FORTRAN programs,"Software--Practice and Experience, vol. 1, pp. 105-133, 1971.
[20] D. J. Kuck, Y. Muraoka, and S.-C. Chen, "On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup,"IEEE Trans. Comput., vol. C-21, no. 12, pp. 1293-1310, Dec. 1972.
[21] S. Melvin and Y. Patt, "Exploiting fine-grained parallelism through a combination of hardware and software techniques," inProc. 18th Annu. Int. Symp. Comput. Architecture, Toronto, Canada, IEEE and ACM, May 1991, pp. 287-296.
[22] Y. Patt, W. Hwu, and M. Shebanow, "HPS, A new microarchitecture: Rationale and introduction," inProc. MICRO-18, ACM, Dec. 1985, pp. 103-108.
[23] D. A. Patterson and C. H. Sequin, "A VLSI RISC,"IEEE Comput. Mag., vol. 15, no. 9, pp. 8-21, Sept. 1982.
[24] V. Popescu, M. Schultz, J. Spracklen, G. Gibson, B. Lightner, and D. Isaman, "The Metaflow architecture,"IEEE Micro, vol. 11, no. 3, June 1991.
[25] B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle, "The Cydra 5 Departmental supercomputer,"IEEE Computer Mag., vol. 22, no. 1, pp. 12-35, Jan. 1989.
[26] J. E. Requa and J. R. McGraw, "The piecewise data flow architecture: Architectural concepts,"IEEE Trans. Comput., vol. C-32, no. 5, pp. 425-438, May 1983.
[27] J. P. Riganati and P. B. Schneck, "Supercomputing,"IEEE Comput. Mag., vol. 17, no. 10, pp. 97-113, Oct. 1984.
[28] E. M. Riseman and C. C. Foster, "The inhibition of potential parallelism by conditional jumps,"IEEE Trans. Comput., pp. 1405-1411, Dec. 1972.
[29] J. E. Thorton, "Parallel operation in the Control Data 6600," inProc. Fall Joint Comput. Conf., AFIPS, 1964, pp. 33-40.
[30] G. S. Tjaden, "Representation and detection of concurrency using ordering matrices," Ph.D. dissertation, Johns Hopkins Univ., 1972.
[31] G. S. Tjaden and M. J. Flynn, "Representation of concurrency with ordering matrices,"IEEE Trans. Comput., vol. C-22, no. 8, pp. 752-761, Aug. 1973.
[32] R. M. Tomasulo, "An efficient algorithm for exploiting multiple arithmetic units,"IBM J., pp. 25-33, Jan. 1967.
[33] A. K. Uht, "Exploitation of low-level concurrency: An implementation and architecture," Feb. 1985. A Ph.D. thesis prospectus presented to the faculty of the Dep. Elec. Comput. Engi., Carnegie-Mellon Univ., Also available as Tech. Rep. CMUCAD-85-52, SRC-CMU Center for Computer-Aided Design, Dep. Elect. Comput. Eng., Carnegie-Mellon Univ., May 1985.
[34] A. K. Uht, Hardware extraction of low-level concurrency from sequential instruction streams," Ph.D. dissertation, Carnegie-Mellon University, Pittsburgh, PA, Dec. 1985. Available from University Microfilms International, Ann Arbor, MI.
[35] A. K. Uht, "An efficient hardware algorithm to extract concurrency from general-purpose code," inProc. Nineteenth Annu. Hawaii Int. Conf. Syst. Sci., Univ. Hawaii, in cooperation with the ACM and the IEEE Computer Society, Jan. 1986.
[36] A. K. Uht and R. G. Wedig, "Hardware extraction of low-level concurrency from serial instruction streams," inProc. Int. Conf. Parallel Processing, IEEE Computer Society and the ACM, Aug. 1986, pp. 729-736.
[37] A. K. Uht, "Incremental performance contributions of hardware concurrency extraction techniques," inProc. Int. Conf. Supercomput., Athens, Greece, Computer Technology Institute, Greece, in cooperation with the ACM, IFIP,et al., June 1987. Springer-Verlag Lecture Note Series.
[38] A. K. Uht, C. D. Polychronopoulos, and J. F. Kolen, "On the combination of hardware and software concurrency extraction methods," inProc. Twentieth Annu. Workshop Microprogramming (MICRO-20), ACM, Dec. 1987, pp. 133-141.
[39] A. K. Uht, "Concurrency extraction via hardware methods executing the static instruction stream--An extended paper," Tech. Rep. CS89-144, Dep. Comput. Sci. Engi., Univ. California, San Diego, Jan. 1989.
[40] A. K. Uht, "Desirable code transformations for a concurrent machine," inResearch Monographs in Parallel and Distributed Computing. Languages and Compilers for Parallel Computing. Cambridge, MA: MIT Press, 1990, ch. 25.Proc. Second Workshop Languages Compilers for Parallel Computing, Center for Supercomputing Res. and Develop., Univ. of Illinois at Urbana-Champaign, Aug. 1-3, 1989.
[41] A. K. Uht, "Notes on a theory of minimal procedural dependencies," Tech. Rep. CS90-165, Dep. Comput. Sci. Eng., Univ. California, San Diego, Feb. 1990.
[42] A. K. Uht, "A theory of reduced and minimal procedural dependencies,"IEEE Trans. Comput.vol. 40, no. 6, pp. 681-692, June 1991.
[43] A. K. Uht and D. B. Johnson, "Data path issues in a highly concurrent machine," Abstract inProc. 19th Annu. Int. Symp. Comput. Architecture, 1992.
[44] D.W. Wall, "Limits of Instruction-Level Parallelism,"Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, ACM, 1991, pp. 176-188.
[45] S. S. Wang, "Enhancing concurrent program execution with eager evaluation," Ph.D. dissertation, Univ. California at San Diego, June 1991. Available as Dep. Comput. Sci. Eng. Tech. Rep. CS91-203.
[46] L. Wang and C. Wu, "Distributed instruction set computer architecture,"IEEE Trans. Comput., vol. 40, no. 8, Aug. 1991.
[47] R. G. Wedig, "Detection of concurrency in directly executed language instruction streams," Ph.D. dissertation, Stanford Univ., June 1982.
[48] R. P. Weicker, "Dhrystone: A synthetic systems programming benchmark,"Commun. ACM, vol. 27, pp. 1013-1030, Oct. 1984.
[49] J. M. Yohe, "Algorithm 428--Hu-Tucker minimum redundancy alphabetic coding method,"Commun. ACM, vol. 15, pp. 360-362, May 1972.

Index Terms:
concurrency extraction; general-purpose streams; semantic extraction; hardware methods; static instruction stream; control-flow; dataflow inhibitors; scientific instruction streams; CONDEL-1; input code control flow model; procedural dependencies; delay; data dependency; machine model; standard sequential code; performance improvement; software optimizations; computer architecture; concurrency control; performance evaluation.
A.K. Uht, "Concurrency Extraction Via Hardware Methods Executing the Static Instruction Stream," IEEE Transactions on Computers, vol. 41, no. 7, pp. 826-841, July 1992, doi:10.1109/12.256451
Usage of this product signifies your acceptance of the Terms of Use.