Sage Journals: Discover world-class research

Abstract

A high-performance computing (HPC) system, which is composed of a large number of components, is prone to failure. To maximize HPC system utilization, one should understand the failure behavior and the reliability of the system. Studies in the literature show that the time to failure of a node is best described by a Weibull distribution. In this study, we consider, without loss of generality, the Weibull as the distribution of time to failure and develop a reliability model for a system of k nodes where nodes can fail simultaneously. From this model, we develop expressions for the probability of failure of the system at any time t, for the failure rate, and for the mean time to failure. Also, we validate the model by using failure data from the Blue Gene/L logs obtained from the Lawrence Livermore National Laboratory. Results show that if failures of the components (nodes) in the system possess a degree of dependency, the system becomes less reliable, which means that the failure rate increases and the mean time to failure decreases. Also, an increase in the number of nodes decreases the reliability of the system.

Keywords

system reliability system time to failure Weibull distribution

Get full access to this article

View all access options for this article.

References

Engelmann

Ong

Scott

(2009) The case for modular redundancy in large-scale high performance computing systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), pp. 189–194.

Gottumukkala

Leangsuksun

Liu

(2006) Reliability analysis in HPC clusters. In Proceedings of High Availability and Performance Workshop (HAPCW) 2006, in conjunction with Los Alamos Computer Science Institute (LACSI) Symposium, Santa Fe, NM, 2006.

Johnson

Kotz

Balakrishnan

(1994) Continuous Univariate Distributions. New York: Wiley-Interscience.

Gottumukkala

Nassar

Paun

(2010) Reliability of a system of k nodes for high performance computing applications. IEEE Transactions on Reliability 59: 162–169.

Hanagal

(1996) A multivariate Weibull distribution. Economic Quality Control 11: 193–200.

Heath

Martin

Nguyen

(2002) Improving cluster availability using workstation validation. In Proceedings of the ACM SIGMETRICS Conference. New York: ACM Press, pp. 217–227.

Hogg

McKean

Craig

(2005) Introduction to Mathematical Statistics, 6th edn. New York: Pearson.

Jones

Daly

DeBardeleben

(2008) Application resilience: making progress in spite of failure. In 8th IEEE International Symposium on Cluster Computing and the Grid, pp. 789–794.

Kundu

Dey

(2009) Estimating the parameters of the Marshall–Olkin bivariate Weibull distribution by EM algorithm. Computational Statistics and Data Analysis 53: 956–965.

10.

Leangsuksun

Shen

Liu

(2005) Achieving high availability and performance computing with an HA-OSCAR cluster. Future Generation Computer Systems 21: 597–606.

11.

Lin

Siewiorek

(1990) Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability 39: 419–432.

12.

Marshall

Olkin

(1967) A multivariate exponential distribution. Journal of the American Statistical Association 62: 30–44.

13.

Parsons

MFG

Wirsching

(1982) A Kolmogorov–Smirnov goodness-of-fit test for the two-parameter Weibull distribution when the parameters are estimated from the data. Microelectronics Reliability 22: 163–167.

14.

Prochan

Sullo

(1976) Estimating the parameters of a multivariate exponential distribution. Journal of the American Statistical Association 71: 465–472.

15.

Sahoo

Squillante

Sivasubramaniam

. (2004) Failure data analysis of a large-scale heterogeneous server environment. In International Conference on Dependable Systems and Network, pp. 772–781.

16.

Schroeder

Gibson

(2006) A large-scale study of failures in high-performance computing systems. In Proceedings of International Symposium on Dependable Systems and Networks (DSN). Los Alamitos, CA: IEEE Computer Society Press, pp. 249–258.

17.

Strimbu

Hickey

Strimbu

(2009) On the use of statistical tests with non-normally distributed data in landscape change detection. Forest Science 55: 72–83.

18.

Varma

Wang

Mueller

. (2006) Scalable, fault tolerant membership for MPI tasks on HPC systems. In Proceedings of the 20th Annual International Conference on Supercomputing, pp. 219–228.

19.

Kalbarczyk

Iyer

(1999) Networked Windows NT system field failure data analysis. In Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing, pp. 178–185.