This Article 
 Bibliographic References 
 Add to: 
The Hector Distributed Run-Time Environment
November 1998 (vol. 9 no. 11)
pp. 1102-1114

Abstract—Harnessing the computational capabilities of a network of workstations promises to off-load work from overloaded supercomputers onto largely idle resources overnight. Several capabilities are needed to do this, including support for an architecture-independent parallel programming environment, task migration, automatic resource allocation, and fault tolerance. The Hector distributed run-time environment is designed to present these capabilities transparently to programmers. MPI programs can be run under this environment on homogeneous clusters with no modifications to their source code needed. The design of Hector, its internal structure, and several benchmarks and tests are presented.

[1] A. Beguelin, J.J. Dongarra, G.A. Geist, and V.S. Sunderam, "Visualization and Debugging in a Heterogeneous Environment," Computer, vol. 26, no. 6, pp. 88-95, June 1993.
[2] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek,, and V. Sunderam,PVM: Parallel Virtual Machine—A Users' Guide and Tutorial for Networked Parallel Computing. The MIT Press, 1994.
[3] W. Gropp, E. Lusk, and A. Skjellum, Using MPI.Cambridge, Mass.: MIT Press, 1994.
[4] R. Butler and E. Lusk, “Monitors, Message, and Clusters: The p4 Parallel Programming System,” Parallel Computing, vol. 20, pp. 547-564, Apr. 1994.
[5] S.H. Russ, B. Flachs, J. Robinson, and B. Heckel, "Hector: Automated Task Allocation for MPI," Proc. 10th Int'l Parallel Processing Symp., pp. 344-348,Honolulu, Hawaii, Apr. 1996.
[6] M.A. Baker, G.C. Fox, and H.W. Yau, "Cluster Computing Review," technical report, Northeast Parallel Architectures Center, Syracuse Univ., Nov.16 1995. Available viahttp://www.mkp.com paradyn/ hypertext/sccs-0748cluster-review.html .
[7] S.H. Russ, J. Robinson, M. Gleeson, B. Meyers, and C.-H. Tan, "Using Hector to Run MPI Programs over Networked Workstations," to appear in Concurrency: Practice and Experience.
[8] M. Stumm, Z. Vranesic, R. White, R. Unrau, and K. Farkas, "Experiences with the Hector Multiprocessor," CSRI Technical Report CSRI-276, Computer Systems Research Inst., Univ. of Toronto, Toronto, Canada, Oct. 1992.
[9] M. Stumm, Z. Vranesic, R. White, R. Unrau, and K. Farkas, "Experiences with the Hector Multiprocessor," Proc. Seventh Int'l Parallel Processing Symp., 1994.
[10] D. Arnold, A. Bond, and M. Chilvers, "Hector: Distributed Objects in Python," Dr. Dobb's Sourcebook, vol 4., no 1., pp. 13-18, Jan./Feb. 1997.
[11] "The Condor Distributed Processing System," Dr. Dobbs' J., pp. 40-48, Feb. 1995.
[12] J. Pruyne and M. Livny, "Providing Resource Management Services to Parallel Applications," Workshop Job Scheduling Strategies for Parallel Processing, Proc. Int'l Parallel Processing Symp. (IPPS '95), Apr.15 1995.
[13] G. Stellner, "Consistent Checkpoints of PVM Applications," Proc. First European PVM User's Group Meeting, 1994.
[14] B.C. Neuman and S. Rao, "The Prospero Resource Manager: A Scalable Framework for Processor Allocation in Distributed Systems," Concurrency: Practice and Experience, vol. 6, no. 4, pp. 339-355, June 1994.
[15] J. Casas, D. Clark, P. Galbiati, R. Konuru, S. Otto, R. Prouty, and J. Walpole, "MIST: PVM with Transparent Migration and Checkpointing," Proc. Third Ann. PVM User's Group Meeting,Pittsburgh, Pa., May 1995
[16] J. Casas, D. Clark, R. Konuru, S.W. Otto, R. Prouty, and J. Walpole, "MPVM: A Migration Transparent Version of PVM," Usenix Computing Systems J., Feb. 1995
[17] DQS User Manual—DQS Version Supercomputer Computations Research Inst., Florida State Univ., June 1995.
[18] L.M. Silva et al., "Portable Checkpointing and Recovery," Proc. Fourth IEEE Int'l Symp. High Performance Distributed Computing, pp. 188-195, Aug. 1995.
[19] K. Chanchio and X.-H. Sun, "Memory Space Representation for Heterogeneous Network Process Migration," Proc. 12th Int'l Parallel Processing Symp. and Ninth Symp. Parallel and Distributed Processing, pp. 801-805, 1998.
[20] D.G. Feitelson, L. Rudolph, U. Schwiegelshohn, K.C. Sevcik, and P. Wong, “Theory and Practice in Parallel Job Scheduling,” Proc. Int'l Parallel and Distributed Processing Symp. Workshop Job Scheduling Strategies for Parallel Processing, pp. 1-34, Apr. 1997.
[21] T.D. Nguyen, R. Vaswani, and J. Zahorjan, "Using Runtime Measured Workload Characteristics in Parallel Processing Scheduling," Proc. IPPS '96 Workshop Job Scheduling Strategies for Parallel Processing,Honolulu, Hawaii, Apr. 1996.
[22] R. Gibbons, “A Historical Application Profiler for Use by Parallel Schedulers,” Proc. Int'l Parallel and Distributed Processing Symp. Workshop Job Scheduling Strategies for Parallel Processing, pp. 58-77, Apr. 1997.
[23] D.G. Feitelson and A.M. Weil, “Utilization and Predictability in Scheduling the IBM SP2 with Backfilling,” Proc. 12th Int'l Parallel Processing Symp., pp. 542-546, Apr. 1998.
[24] S.H. Russ, B. Meyers, C.-H. Tan, and B. Heckel, "User-Transparent Run-Time Performance Optimization," Proc. Second Int'l Workshop on Embedded High Performance Computing, associated with the 11th Int'l Parallel Processing Symp. (IPPS '97),Geneva, Apr. 1997.
[25] J. Robinson, S.H. Russ, B. Flachs, and B. Heckel, "A Task Migration Implementation for the Message-Passing Interface," Proc. IEEE Fifth High Performance Distributed Computing Conf. (HPDC-5), pp. 61-68,Syracuse, N.Y., Aug. 1996.
[26] M. Litzkov, T. Tannenbaum, J. Basney, and M. Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System," Computer Sciences Technical Report no. 1346, Univ. of Wisconsin, Madison, Apr. 1997.
[27] Condor home page, http://www.cs.wis.educondor, Mar. 1998.
[28] F.H. McMahon, The Livermore Fortran Kernels: A Computer Test Of The Numerical Performance Range, UCRL-53745. Lawrence Livermore Nat'l Lab., Livermore, Calif., Dec. 1986.
[29] S.H. Russ, K. Reece, J. Robinson, B. Meyers, L. Rajagopalan, and C.-H. Tan, "An Agent-Based Architecture for Dynamic Resource Management," submitted to IEEE Concurrency.
[30] J. Pruyne and M. Livny, “Managing Checkpoints for Parallel Programs,” Job Scheduling Strategies for Parallel Processing, IPPS'96 Workshop, D.G. Feitelson and L. Rudolph, eds., vol. 1162, pp. 140-154, Apr. 1996.
[31] R. Guerraoui and A. Schiper, Software-Based Replication for Fault Tolerance Computer, pp. 68-74, Apr. 1997.
[32] S.H. Russ, "An Architecture for Rapid Distributed Fault Tolerance," Proc. Third Int'l Workshop Embedded High-Performance Computing, pp. 925-930, 1998.

Index Terms:
Parallel computing, load balancing, fault tolerance, resource allocation, task migration.
Samuel H. Russ, Jonathan Robinson, Brian K. Flachs, Bjørn Heckel, "The Hector Distributed Run-Time Environment," IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 11, pp. 1102-1114, Nov. 1998, doi:10.1109/71.735957
Usage of this product signifies your acceptance of the Terms of Use.