18th IEEE Symposium on Reliable Distributed Systems
Failure Data Analysis of a LAN of Windows NT Based Computers
Lausanne, Switzerland
October 18-October 21
ISBN: 0-7695-0290-3
This paper presents results of a failure data analysis of a LAN of Windows NT machines. Data for the study was obtained from event logs collected over a six-month period from the mail routing network of a commercial organization. The study focuses on characterizing causes of machine reboots. The key observations from this study are: (1) most of the problems that lead to reboots are software related, (2) rebooting the machine does not always solve the problem (in about 60% of the reboots, the re-booted machine reported problems within an hour or two of the reboot), (3) there are indications of propagated or correlated failures, and (4) though the average availability evaluates to over 99%, the machine downtime lasts (on average) two hours. Since the machines are dedicated mail servers, bringing down one or more of them can potentially disrupt storage, forwarding, reception and delivery of mail. This suggests that the average availability is not a good measure to characterize this type of network service.
Citation:
M. Kalyanakrishnam, Z. Kalbarczyk, R. Iyer, "Failure Data Analysis of a LAN of Windows NT Based Computers," srds, pp.178, 18th IEEE Symposium on Reliable Distributed Systems, 1999