This Article 
 Bibliographic References 
 Add to: 
Design and Evaluation of Hardware Strategies for Reconfiguring Hypercubes and Meshes Under Faults
July 1994 (vol. 43 no. 7)
pp. 841-848

This paper discusses the design of two reconfiguration strategies for distributed memory multicomputer architectures under failures. The specific architectures to which we apply the techniques are hypercubes and meshes. The first scheme uses spare processors attached to certain processors in the hypercube or mash using a novel embedding technique. The second approach places spare processors along specific links in the hypercube or mesh. Both schemes involve the mapping of logical links of a virtual machine onto a set of physical links in the final reconfigured machine and hence suffer some performance degradation. We characterize the performance degradation through trace-driven simulation of real applications running on the faulty and reconfigured system. We find that the schemes have high reliability, suffer little degradation in performance, and are very low in cost.

[1] C. L. Seitz, "The Cosmic Cube,"Commun. ACM, pp. 22-33, Jan. 1985.
[2] W. J. Dally and C. L. Seitz, "The torus routing chip,"Distributed Computing, vol. 1, pp. 187-196, 1986.
[3] D. A. Rennels, "On implementing fault tolerance in binary hypercubes," inProc. 16th Int. Symp. Fault-Tolerant Computing, July 1986, pp. 344-349.
[4] P. Banerjeeet al., "An evaluation of system-level fault tolerance on the intel hypercube multiprocessor," inProc. 18th Int. Symp. Fault-Tolerant Comput., 1988, pp. 362-367.
[5] S-C. Chau and A. L. Liestman, "A proposal for a fault-tolerant binary hypercube," inProc. Nineteenth Fault Tolerant Comput. Symp., Chicago, IL, June 1989, pp. 323-330.
[6] M. S. Alam and R. G. Melham, "An efficient modular spare allocation scheme and its application to fault tolerant binary hypercubes,"IEEE Trans. Parallel Distrib. Syst., vol. 3, pp. 117-126, Jan. 1991.
[7] E. Chow, H. S. Madan, J. C. Peterson, D. Grunwald, and D. Reed, "Hyperswitch network for the hypercube computer," inProc. 15th Ann. Int. Symp. Comput. Architecture, 1988, pp. 90-99.
[8] S. F. Nugent, "The iPSC/2 Direct-Connect communications technology," inProc. Third Conf. Hypercube Comput. Appl., Pasadena, CA, Jan. 1988, pp. 56-60.
[9] P. Kermani and L. Kleinrock, "Virtual cut-through: A new computer communication switching technique,"Comput. Netw., vol. 3, pp. 267-286, Sept. 1979.
[10] T. C. Lee and J. P. Hayes, "Routing and broadcasting in faulty hypercube computers," inProc. 3rd Conf. Hypercube Concurrent Comput. and Appl., vol. I, Jan. 1988, pp. 346-354.
[11] M.-S. Chen and K. G. Shin, "Depth-first search approach for fault-tolerant routing in hypercube multicomputers,"IEEE Trans. Parallel Distrib. Syst., vol. 2, pp. 152-159, Apr. 1990.
[12] J. M. Gordon and Q. F. Stout, "Hypercube message routine in the presence of faults," inProc. 3rd Conf. Hypercube Concurrent Comput. and Appl., Jan. 1988, pp. 318-327.
[13] H. Badr and S. Podar, "An optimal shortest-path routing policy for network computers with regular mesh-connected topologies,"IEEE Trans. Comput., vol. 38, pp. 1362-1371, Oct. 1989.
[14] M. Peercy and P. Banerjee, "Distributed algorithms for shortest-path deadlock-free routing and broadcasting in arbitrarily faulty hypercubes," inProc. FTCS-20, June 1990, pp. 218-225.
[15] P. Banerjee, "Reconfiguring a hypercube in the presence of faults," inProc. 4th Conf. Hypercube Concurrent Comput. and Applicat., Mar. 1989, pp. 200-207.
[16] P. Banerjee, "Strategies for Reconfigurating Hypercubes under Faults,"Proc. Int'l Symp. Fault-Tolerant Computing systems, CS Press, 1990.
[17] J. Hsu and P. Banerjee, "Performance measurements and trace-driven simulation of parallel CAD and numeric applications on hypercube multicomputers,"IEEE Trans. Parallel Distrib. Syst., vol. 3, no. 4, pp. 451-464, July 1992.
[18] M. Livingston and Q. F. Stout, "Distributing resources in hypercube computers," inProc. 3rd Conf. Hypercube Concurrent Computers and Applications, Jan. 1988, pp. 222-231.
[19] A. L. N. Reddy, P. Banerjee, and S. G. Abraham, "I/O embedding in hypercubes," inProc. 17th Int. Conf. Parallel Processing, St. Charles, IL, Aug. 15-19, 1988, pp. 331-338.
[20] J. Wakerly,Error Detecting Codes, Self-Checking Circuits and Applications. New York: Elsevier North Holland, 1978.
[21] C. H. Papadimitriou and K. Steiglitz,Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs, NJ: Prentice-Hall, 1982.
[22] P. Banerjee and M. Peercy, "Design and evaluation of hardware strategies for reconfiguring hypercubes and meshes under faults," Coordinated Science Lab. Tech. Rep., 1992.
[23] H. D. Schwetman, "CSIM: A C-based, process-oriented simulation language," Tech. Rep. PP-080-85, Austin, TX, 1985.
[24] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, and J. K. Salmon,Solving Problems on Concurrent Processors. Englewood Cliffs, NJ: Prentice-Hall, 1989.
[25] A. Pothen, S. Jha, and U. Vemulapati, "Orthogonal facotrization on a distributed memory multiprocessor," inProc. 2nd SIAM Conf. Hypercube Comput. and Appicat., 1987, pp. 587-596.
[26] M. Gupta and P. Banerjee, "Demonstration of automated data partitioning techniques in parallelizing compilers for distributed memory multiprocessors,"IEEE Trans. Parallel Distrib. Syst., vol. 3, no. 2, pp. 179-193, Mar. 1992.
[27] K. P. Belkhale and P. Banerjee, "PACE2: An improved parallel VLSI extractor with parametric extraction," inProc. Int. Conf. Comput.-Aided Design, Santa Clara, CA, Nov. 1989, pp. 526-530.
[28] J. Sargent and P. Banerjee, "A parallel row-based algorithm for standard cell placement with integrated error control," inProc. 26th Design Automation Conf., Las Vegas, NV, June 1989, pp. 590-594.
[29] S. Patil and P. Banerjee, "A parallel branch and bound approach to test generation,"IEEE Trans. Comput.-Aided Design of Circuits and Syst., vol. 9, pp. 313-322, Mar. 1990.

Index Terms:
hypercube networks; reconfigurable architectures; discrete event simulation; performance evaluation; distributed memory systems; hardware strategies; reconfiguring hypercubes; reconfiguring meshes; distributed memory multicomputer architectures; embedding technique; logical links mapping; virtual machine; performance degradation; trace-driven simulation.
P. Banerjee, M. Peercy, "Design and Evaluation of Hardware Strategies for Reconfiguring Hypercubes and Meshes Under Faults," IEEE Transactions on Computers, vol. 43, no. 7, pp. 841-848, July 1994, doi:10.1109/12.293264
Usage of this product signifies your acceptance of the Terms of Use.