Kishor Trivedi

Duke University
Box 90291, ECE Dept., 206 Hudson Hall
Durham, NC 27708-0291
USA
Phone: (919) 660-5269
Fax: (919) 660-5293
Email: kst@ee.duke.edu


DVP term expires December 2013

Kishor Trivedi holds the Hudson Chair in the Department of Electrical and Computer Engineering at Duke University, Durham, NC. He also holds a joint appointment in the Department of Computer Science at Duke. He was the Duke-Site Director of an NSF Industry-University Cooperative Research Center between NC State University and Duke University for carrying out applied research in computing and communications. He has been on the Duke faculty since 1975. He has served as a Principal Investigator on various AFOSR, ARO, Burroughs, DARPA, Draper Lab, IBM, DEC, Alcatel, Telcordia, Motorola, NASA, NIH, ONR, NSWC, Boeing, Union Switch and Signals, NSF, and SPC funded projects and as a consultant to industry and research laboratories. He was an Editor of the IEEE Transactions on Computers from 1983-1987. He is on the editorial board of the IEEE Transactions on Dependable and Secure Systems. He is a co-designer of HARP, SAVE, SHARPE, SPNP, and SREPT modeling packages. These packages have been widely circulated. He is the author of a well known text entitled, Probability and Statistics with Reliability, Queuing and Computer Science Applications, originally published by Prentice-Hall. A thoroughly revised second edition has been published by Wiley. A comprehensive solution manual for the second edition containing more than 300 problem solutions is now available from the publisher (John Wiley). He has also published two other books entitled, Performance and Reliability Analysis of Computer Systems, published by Kluwer Academic Publishers and Queueing Networks and Markov Chains, John Wiley. Second edition of the latter book was published in 2006. He has edited two books, Advanced Computer System Design, published by Gordon and Breach Science Publishers, and Performability Modeling Tools and Techniques, published by John Wiley & Sons. His research interests are in reliability and performance assessment of computer and communication systems. He has published over 450 articles and lectured extensively on these topics. He has supervised 42 Ph.D. dissertations. He is a Fellow of the Institute of Electrical and Electronics Engineers. He is a Golden Core Member of IEEE Computer Society.

Recent research accomplishments include three areas of activity: Advances in modeling techniques; performance, reliability, availability and survivability modeling of applications; and development and dissemination of modeling tools. Kishor and his colleagues have developed polynomial time algorithms for performability analysis, numerical solution techniques for completion time problems, algorithms for the numerical solution of the response time distribution in a closed queueing network, techniques to solve large and stiff Markov chains, and algorithms for the automated generation and solution of stochastic reward nets including sensitivity and transient analysis. His group has also developed fast algorithms for the solution of large fault trees and reliability graphs including multistate components and phase mission systems analysis. His group has developed new formalisms of fluid stochastic Petri nets and Markov regenerative stochastic Petri nets. His group has developed many tools - SHARPE, SPNP and SREPT - which have been used at over 500 academic and industrial laboratories. The graphical user-interfaces for these tools have been recently developed. These tools also form the core of BOEING's integrated reliability analysis package. Kishor's group has been in the forefront of the development of fundamentals of software aging and rejuvenation. His methods of software rejuvenation have been implemented in the IBM x series servers.


Software Aging and Rejuvenation: Modeling and Analysis
Recently, the phenomenon of "software aging", one in which the state of the software system degrades with time, has been reported. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure or both. Software aging has been reported in widely used software and also in high-availability and safety-critical systems. To counteract this phenomenon, a proactive approach of fault management, called "software rejuvenation" has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. The preventive action can be done at optimal times (for example, when the load on the system is low) so that the overhead due to planned system downtime is minimal. This method therefore avoids unplanned and potentially expensive system outages due to software aging.

In this talk, we will discuss methods of evaluating the effectiveness of proactive fault management in operational software systems and determining optimal times to perform rejuvenation. This is done by developing stochastic models which tradeoff the cost of unexpected failures due to software aging with the overhead of proactive fault management. The second half the talk will deal with measurement-based models which are constructed using workload and resource usage data collected from operating systems over a period of time. The measurement-based models are the first steps towards predicting aging related failures, intended to help development of strategies for software rejuvenation triggered by actual measurements. Finally, we discuss the implementation of a software rejuvenation agent in a major commercial server.

End-to-End Performability Analysis for Infrastructure-as-a-Service Cloud
Handling diverse client demands and managing unexpected failures without degrading performance are two key promises of a cloud delivered service. However, evaluation of cloud service quality becomes difficult as the scale and complexity of cloud system increases. In a cloud environment, a service request from a user goes through a variety of provider specific processing steps from the instant it is submitted until the service is fully delivered. Measurement-based evaluation is expensive especially if many configurations, workload scenarios, and management methods are to be analyzed.  To overcome these difficulties, in this talk we propose a general analytic model based approach for end-to-end performability analysis of  a cloud service. We illustrate our approach using Infrastructure-as-a-Service (IaaS) cloud, where service availability and provisioning delays are two key QoS metrics. A novelty of our approach is in reducing the complexity of analysis by dividing the overall model into multiple interacting stochastic process models and then obtaining the overall solution by iteration over individual sub-model solutions. In contrast to a single one-level monolithic model, our approach yields a high fidelity model that is tractable and scalable. Our approach and underlying models can be readily extended to other types of cloud services and are applicable to public, private and hybrid clouds.

Availability Modeling in Practice

The successful development and marketing of commercial high-availability systems requires the ability to evaluate the availability of systems. Specifically, one should be able to demonstrate that projected customer requirements are met, to identify availability bottlenecks, to evaluate and compare different configurations, and to evaluate and compare different designs.  For evaluation approaches based on analytic modeling, these systems are often sufficiently complex so that state-space methods are not effective due to the large number of states, whereas combinatorial methods are inadequate in capturing all significant dependencies.  The two-level (or multi-level) hierarchical composition proposed here is found to be suitable for the availability modeling of many commercial systems at Cisco, EMC, IBM, Motorola, NEC, SUN Microsystems, and others.  As an example, we present the availability model of a high availability SIP Application Server configuration on WebSphere. Hardware, operating system and application server failures are considered. Different types of fault detectors, detection delays, failover delays, restarts, reboots and repairs are considered. Imperfect coverages for detection, failover and recovery are incorporated. The parameter values used in the calculations are based on several sources, including field data, high availability testing, and agreed upon assumptions. In cases where a parameter value is uncertain, due to assumptions or limited test data, a sensitivity analysis of that parameter is carried out. Relaxation of some of the assumptions will be discussed as well as the difficulties encountered while carrying out such projects.

A Non-Obtrusive Method for Uncertainty Propagation in Analytic Dependability Models
Abstract: In this paper, a method for propagating the epistemic uncertainty in the model parameters, through the system dependability model is discussed. This method acts as a wrapper to already existing stochastic models and does not need to manipulate the basic model, giving it a wide range of applicability and ease of use. It is also independent of the solution method of the underlying model and pre-existing model solution methods or tools are relied upon. The applicability of this method is illustrated with some real examples. While our examples discuss confidence intervals for system availability, service reliability and performance, this method can be directly applied to compute uncertainty in the output metrics of other stochastic analytic models of dependability, performance and performability. It is important to note that though this is a sampling based method, no simulation is carried out in our method but actual execution of the underlying analytic model is performed. An adequate number of samples over the parameter space is chosen and the analytic model is solved at each set of sampled parameter values. Statistical analysis of the output vector then yields the distribution and confidence intervals of the model output. Latin Hypercube Sampling (LHS) and random sampling are applied and their robustness is compared.

Low Power Test
Abstract: Test power is considered a well known issue to design-for-test (DFT) engineers. Many techniques have already been developed to address this issue during both shift and at-speed test. Most techniques have attempted to focus on dealing with don’t-care bits in test patterns to reduce switching activity in the scan chain and combinational circuit. However, the main issue has yet to be fully addressed that is the need for accurate power model and analysis to fully understand the behavior of the circuit and power distribution network in presence of switching activity. In addition, no analysis has been done to understand the localized effects of the switching in the circuit. Also, due to the increased test power, the localized large current spikes in the chip can negatively impact the wafer test process by burning test probes. Identification of the patterns that cause such peak currents and design of power distribution networks that are tolerant to such current spikes are part of this research. This talk will briefly describe some of our findings about the above issues.