This Article 
 Bibliographic References 
 Add to: 
Resource Allocation for Primary-Site Fault-Tolerant Systems
February 1993 (vol. 19 no. 2)
pp. 108-119

Resource allocation for a distributed system employing the primary site approach for fault tolerance is discussed. Two kinds of systems are considered. The first consists of fault-tolerant nodes where each node has many duplicated servers. One server is the primary, which serves user requests, and the rest are backup. The second does not have fault-tolerant nodes. To tolerate node failures, each node uses other nodes as backups. When a node fails, all requests initially allocated to the node are served by one of its backups. To study the resource allocation for such systems, an approximate model for each system is developed. Using these models, efficient allocation algorithms that take into account the failure/repair rates of the system and the fault-tolerant overheads are presented. Using experimental results, it is shown that the algorithms give the optimal or suboptimal allocations. The algorithms, which incur little overhead, can improve the system performance significantly over an intuitive allocation algorithm.

[1] K.S. Trivedi, R.A. Wagner, and T.M. Sigmon, "Optimal Selection of CPU Speed, Device Capacities, and File Assignment,"J. ACM, Vol. 7, No. 3, 1980.
[2] K. K. Ramakrishnan and A. K. Agrawala, "A resource allocation policy using time thresholding," in A. Agrawala and S. K. Tripathi, Eds.,Performance '83. New York: North-Holland, 1983, pp. 395-413.
[3] A. Agrawala and S. K. Tripathi, "Transient solution of the virtual waiting time of a single-server queue and its applications,"Information Sci., vol. 21, no. 3, pp. 141-158, July 1980.
[4] A. N. Tantawi, D. Towsley, and J. Wolf, "An algorithm for a class constrained resource allocation problem," University of Massachusetts, Dept. of Computer Science, Amherst, MA, Tech. Rep. RC 13053, Apr. 1987.
[5] S. K. Tripathi and C. M. Woodside, "A vertex-allocation theorem for resources in queueing networks,"J. Ass. Comput. Mach., vol. 35, no. 1, pp. 221-230, Jan. 1988.
[6] A. N. Tantawi and D. Towsley, "A general model for optimal static load balancing in star network configurations," inProc. Performance '84, pp. 277-291.
[7] A. N. Tantawi and D. Towsley, "Optimal static load balancing in distributed computer systems,"J. ACM, pp. 445-465, Apr. 1985.
[8] W. W. Chu and L. M.-T. Lan, "Task allocation and precedence relations for distributed real-time systems,"IEEE Trans. Comput., vol. C-36, pp. 667-679, June 1987.
[9] F. Bonomi and A. Kumar, "Adaptive optimal load balancing in a heterogeneous multiserver system with a central job scheduler," inProc. 8th DCS, June 1988, pp. 500-507.
[10] L. W. Dowdy, and D. V. Foster, "Comparative models of the file ss-signment problem,"ACM Comput. Surveys, vol. 14, no. 2, pp. 287-313, June 1982.
[11] C. M. Woodside and S. K. Tripathi, "Optimal allocation of file servers in a local network environment,"IEEE Trans. Software Eng., vol. SE-12, no. 8, pp. 844-848, 1986.
[12] J. Bannister and K. S. Trivedi, "Task allocation in fault-tolerant distributed systems,"Acta Informatica, vol. 20, pp. 261-281, 1983.
[13] E. Gelenbe, D. Finkel, and S. Tripathi, "Availability of a distributed computer system with failures,"Acta Informatica, vol. 23, pp. 643-655, 1986.
[14] S. K. Tripathi, D. Finkel, and E. Gelenbe, "Load sharing in distributed system with failures,"Acta Informatica, vol. 25, Aug. 1988.
[15] P. A. Alsberg and J. D. Day, "A principle for resilient sharing of distributed resources," inProc. 2nd Int. Conf. Software Eng., San Franscisco, CA, Oct. 1976, pp. 562-570.
[16] P. A. Bernstein, "Sequoia: A fault tolerant tightly coupled multiprocessor for transaction processing,"IEEE Trans. Comput., vol. 21, no. 2, pp. 37-45, Feb. 1988.
[17] J. Bartlett, "A NonStop Kernel," Eighth Sigops, ACM, New York, 1981, pp. 22-29.
[18] E. D. Lazowska, J. Zahorjan, D. Cheriton, and W. Zwaenepoel, "File access performance of diskless workstations,"Tech. Rep., Department of Computer Science, University of Washington, Seattle, WA, June 1984.
[19] L. Dowdy, D. L. Eager, K. D. Gordon, and L. V. Saxton, "Throughput concavity and response time convexity,"Information Processing Lett., vol. 19, no. 3, pp. 209-212, Nov. 1984.
[20] B. Fox, "Discrete optimization via marginal analysis,"Management Sci., vol. 13, pp. 212-216, 1966.
[21] K. S. Trivedi,Probability and Statistics with Reliability, Queueing and Computer Science Applications. Englewood Cliffs, NJ: Prentice-Hall, 1982.
[22] K. C. Sevcik, "Priority scheduling disciplines in queueing network models for computer systems," inProc. IFIP Cong. 77, Amsterdam, The Netherlands, 1977.
[23] E. D. Lazawskaet al., Quantitative System Performance--Computer System Analysis Using Queueing Network Models. Englewood Cliffs, NJ: Prentice-Hall, 1984.
[24] Y. Huang, "Resource allocation with fault tolerance," Ph.D. dissertation, University of Maryland, College Park, MD, Sept. 1989.

Index Terms:
resource allocation; primary-site fault-tolerant systems; distributed system; server; node failures; approximate model; system performance; distributed processing; fault tolerant computing; file servers; performance evaluation; resource allocation
Y. Huang, S.K. Tripathi, "Resource Allocation for Primary-Site Fault-Tolerant Systems," IEEE Transactions on Software Engineering, vol. 19, no. 2, pp. 108-119, Feb. 1993, doi:10.1109/32.214829
Usage of this product signifies your acceptance of the Terms of Use.