Hard Disk Failure Prediction Based on Blending Ensemble Learning

Zhang, Mingyu; Ge, Wenqiang; Tang, Ruichun; Liu, Peishun

doi:10.3390/app13053288

Open AccessArticle

Hard Disk Failure Prediction Based on Blending Ensemble Learning

by

Mingyu Zhang

,

Wenqiang Ge

,

Ruichun Tang

and

Peishun Liu

^*

College of Information Science and Technology, Ocean University of China, Qingdao 266100, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(5), 3288; https://doi.org/10.3390/app13053288

Submission received: 30 January 2023 / Revised: 20 February 2023 / Accepted: 27 February 2023 / Published: 4 March 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

As the most widely used storage device today, hard disks are efficient and convenient, but the damage incurred in the event of a failure can be very significant. Therefore, early warnings before hard disk failure, allowing the stored content to be backed up and transferred in advance, can reduce many losses. In recent years, an endless stream of research on the prediction of hard disk failure prediction has emerged. The detection accuracy of various methods, from basic machine learning models, such as decision trees and random forests, to deep learning methods, such as BP neural networks and recurrent neural networks, has also been improving. In this paper, based on the idea of blending ensemble learning, a novel failure prediction method combining machine learning algorithms and neural networks is proposed on the publicly available BackBlaze hard disk datasets. The failure prediction experiment is conducted only with S.M.A.R.T., that is, the learned characteristics collected by self-monitoring analysis and reporting technology, which are internally counted during the operation of the hard disk. The experimental results show that this ensemble learning model is able to outperform other independent models in terms of evaluation criterion based on the Matthews correlation coefficient. Additionally, through the experimental results on multiple types of hard disks, an ensemble learning model with high performance on most types of hard disks is found, which solves the problem of the low robustness and generalization of traditional machine learning methods and proves the effectiveness and high universality of this method.

Keywords:

hard disk; failure prediction; S.M.A.R.T.; ensemble learning

1. Introduction

1.1. Background

Hard disks are the primary and most important storage device used in computers today, and many data centers rely on large numbers of hard disks to store vital information. With the rapid development of the Internet and cloud platform, the storage and processing of massive data have brought severe challenges to the relevant personnel and storage systems [1]. A small hard disk failure will cause a significant amount of data loss, thus causing economic losses. To reduce this situation, S.M.A.R.T. (self-monitoring analysis and reporting technology) was developed in the 1990s. Led by Compaq and developed in collaboration with several hard disk manufacturers, this new technology detects all kinds of working information inside the hard disk, such as the number of hard disk reads and writes, the number of head loads/unloads, the seek error rate, and the current ambient temperature of the hard disk, and writes them to the specified register through binary code. These attribute values are usually updated once a day. According to these attribute values, users can compare them with the threshold specified by various manufacturers, so that they can take measures to deal with the failure in advance. Some hard disks can even alarm themselves to remind users to back up data to reduce losses. In an ideal state, as long as some attribute values of the hard disk do not exceed the threshold and are within a reasonable range, the hard disk will not have problems [2].

There are many reasons why a hard disk fails. Failure can be caused by internal reasons, such as damage to parts caused by an excessively long service time or a sector failure caused by an accident, while indirect and direct effects on hard disks are caused by external reasons, such as vibrations, dust, static electricity, magnetic fields, and unstable voltage. There are thus many factors that can change S.M.A.R.T. attribute values. However, the detection rate achieved by reading the change range and limits of specific attribute values, including the changing S.M.A.R.T. values and the thresholds set by the manufacturer, which can avoid certain losses, is very low, at only about 10%.

Therefore, scholars have made use of the technology in the computer field that has gradually matured to predict hard disk failures and achieved good results.

1.2. Related Work

Hughes et al. [3] improved the distribution-free statistical hypothesis testing algorithm that replaced the maximum error threshold warning algorithm and proposed two new methods based on this mathematical statistical approach, namely the ORing single-attribute rank-sum test and the multivariate rank-sum test. These authors also conducted experiments based on S.M.A.R.T. characteristics to obtain fault warning accuracy rates of about 40–60% and a false alarm rate of around 0.2–0.5%. Wang Yu et al. [4] proposed a dynamic tracking method for the prediction of hard disk failure based on a switchable state stochastic process model. The Rao–Blackwelled particle filter was used to update model estimates and parameters, and a dynamic failure threshold was designed. However, the overall framework was complex, and the universality was not strong. Compared to mathematical statistical methods, the machine learning technology that has been rising and developing in recent years just meets the needs of people, and various methods can achieve good results. Yang Qibo et al. [5] counted four feature selection methods and eight machine learning methods for anomaly detection. Through permutations and combinations, they conducted experiments on two datasets and obtained the combination of feature selection methods and anomaly detection methods that had the best effect at that time, providing a good idea and direction for subsequent researchers. Vikas Tomer et al. experimented with the naive Bayes, decision tree, and random forest algorithms and used accuracy, precision, AUC, and running time as criteria that indicated that the random forest algorithm had the best performance.

Furthermore, with the proposal and development of deep learning, neural networks have come into people’s view. Various neural networks that can perform complex operations and process time series data can predict hard disk failures more accurately and efficiently. Hu Lihan et al. [1] used an LSTM network to process time series data and obtained better results than machine learning methods. However, the model may age with long-term use, and the model structure needs to be further improved. Based on the work done by Hu Lihan, Cahyadi et al. [6] improved the previous network structure and focused their work on solving the problem of data imbalance. They used a partial undersampling method to obtain better results but with insufficient optimization of the model and without considering the use of better sampling methods. After the spread of bidirectional LSTM networks, Austin Coursey [2] proposed a new method of data standardization. The new model was used for experiments. Different from the previous methods that performed classification tasks, it was used to predict the remaining life of hard disks, and the curve fitting was very good. Alessio Burrello et al. [7] introduced a new convolution network known as TCN (temporal convolutional network) for time series analysis and used SMOTE (synthetic minority over-sampling technique) to deal with the problem of data imbalance. Luo Chuan et al. [8] not only used the characteristics of the hard disk itself but also considered the influence of the neighborhood of the hard disk and proposed a new sampling method, TPS (temporal progressive sampling), to solve the influence of data imbalance. Lu Sidi [9] took S.M.A.R.T. as the learning feature and added the performance data of hard disks and servers as well as the influence of the spatial location of the hard disk in reality. These authors used a new neural network known as CNN-LSTM for the experiments. The final results showed that the experimental results considering all three indicators are the best.

Table 1 shows the methods used by previous researchers and the results obtained. According to the models and methods used by predecessors, we notice that all the above methods used for hard disk failure prediction are simple machine learning or deep learning methods, which generally have the disadvantages of long running times and require a large amount of training data. Recent articles have only focused on using new technologies, the robustness and universality of any single model used has certain limitations [10,11]. Therefore, we believe that we can use the idea of ensemble learning to combine machine learning methods with deep neural networks to obtain better results with shorter training times and to create models with better universality.

Based on the blending ensemble learning method, this paper proposes a hard disk failure prediction method that combines machine learning algorithms and deep learning neural networks and conducts experiments on the public datasets collected by BackBlaze. Finally, the Matthews correlation coefficient is used as the evaluation criterion, the value of which obtained by our method is higher than using a separate model, and the training time of the model is markedly reduced, proving the effectiveness of this method. In addition, we fixed the datasets of a certain model of hard disk as the training set to test six different models of hard disks from the same or different manufacturers. According to the experiment results, an ensemble learning model that can achieve good results on all datasets was constructed, which has high universality and robustness.

The rest of this paper is organized as follows: Section 2 briefly summarizes the background of ensemble learning and the history of its development. The method framework and sub-methods used in this paper are described and introduced in Section 3. Section 4 introduces the dataset used in this paper and the method of preprocessing and compares the results of the experiments. The last section is the conclusion, which summarizes the full text, points out the shortcomings, and provides an outlook for future work.

2. Ensemble Learning

Ensemble learning, which is an important branch of machine learning, emerged early in 1988 [12]. Its basic idea is to make use of N weak learners with a weak learning effect as the base learners, integrate their results according to different rules, and treat the whole as a strong learner to make the final results more accurate.

Bagging and boosting are commonly used ensemble learning strategies. The former was put forward by Leo Breiman of Berkeley University as early as 1996 [13]. The strategy of training sample perturbation involves training multiple different base learners in parallel under the premise of multiple random samples put back, then finally taking the plurality or average of all results to obtain a final result according to the actual situation. Later, in 1998, Tin Kam Ho [14] proposed the random subspace method (RSM), which uses the strategy of input attribute perturbation to train the base classifier by randomly selecting some features instead of all features in each training and obtains the final result by taking the average or the plurality of the results of each base classifier. Combining the above two random perturbation approaches, Leo Breiman [15] proposed the random forest algorithm in 2001, which has been used as a representative algorithm of the bagging strategy until now. Correspondingly, the boosting method proposed by Robert E. Schapire in 1990 [16] also follows the same strategy of training sample perturbation. The basic idea is to increase the weight of misclassified samples after each training of the base learner to train the next base learner until a strong learner with good results is obtained. Because the base learners need to be trained serially, there is a strong dependency between the base learners. Taking the representative algorithm AdaBoost as an example, its final result can be the weighted voting of each base learner. Other algorithms are GBDT and its improved versions such as XGBoost and LightGBM. In addition to this, there are strategies such as algorithmic parameter perturbation, output label perturbation, or mixing different perturbation methods to train different base classifiers. Several of these representative algorithms will be presented in Section 3.

In addition to the two common ensemble learning strategies described above, there are two extended strategies, which are stacking [17] and blending, the technique used in this paper. The principles of the two ensemble learning strategies are basically the same. Stacking is divided into two layers; the first layer uses k-fold cross-validation of the training sets using M base learners to obtain k results on the verification sets and k results on the test sets. The k results are then grouped and spliced to obtain new training sets and test sets [18]. The obtained new training set and test set discard the original features of the data and replace them with the new features of M columns, that is, the prediction results on M base learners. Finally, the new dataset is trained and tested on the second layer using another learner to obtain the final results. The blending method abandoned k-fold cross-verification on the basis of stacking, which can simplify the algorithm process and greatly reduce the running time, while the training effects are not significantly different. Therefore, this paper chose the blending ensemble learning method as the basis to construct the experimental framework.

Algorithm 1 presents the pseudo code of the construction process of the blending ensemble learning model.

Algorithm 1 The construction process of blending ensemble learning model.

Input: train set

D = {x_{i}, y_{i}}_{i = 1}^{m}

Output: Ensemble classifier model H

1: Step 1: learn base learner classifiers

2: for

t = 1

to T do

3: learn

h_{t}

based on D

4: end for

5: Step 2: construct new train set

6: for

i = 1

to m do

7:

D_{h} = {x_{i}^{^{'}}, y_{i}}

, where

x_{i}^{^{'}} = {h_{1} (x_{i}), \dots, h_{T} (x_{i})}

8: end for

9: Step 3: learn a ensemble classifier

10: learn H based on

D_{h}

11: return H

3. Materials and Methods

The algorithms used for each component of the ensemble learning model used in this experiment are described this section. Because this paper actually solved a binary classification problem, we will focus on describing the principles and characteristics of each algorithm in classification tasks when introducing them.

3.1. Week Learners

3.1.1. Logistic Regression

Logistic regression is a machine learning algorithm often used for classification. Similar to linear regression, the difference is the function used to fit the data. Linear regression is used to calculate a linear function to fit the data according to the characteristics of the learned data, so as to predict a new point. Conversely, logistic regression uses a sigmoid function to fit the data, which is more efficient than a linear function. Since the value range of a sigmoid function is (0, 1), in the binary classification task, the value of a sigmoid function at a point can be regarded as the probability of taking a positive sample, and then the prediction result at that point can be judged by comparing it with the threshold set.

The form of the fitting function of multiple linear regression is

y = ω_{0} x_{0} + ω_{1} x_{1} + \dots + ω_{n} x_{n}

or

y = W^{T} X

, and the fitting function of logistic regression, which is

g (z) = (\frac{1}{1 + e^{W^{T} X}})

, can be obtained by substituting it into the sigmoid function. At this time, the probability that a point is predicted to be a positive sample is obtained, and the probability of successful prediction at a certain point can be further obtained:

P (t r u e) = {(g (ω, x_{i}))}^{y^{i}} * {(1 - g (ω, x_{i}))}^{1 - y^{i}}

(1)

where

y^{i}

refers to the predicted value of a sample, which is either 0 or 1 in the binary classification task.

After obtaining the probability of successful prediction on a sample, one may wish to maximize the prediction success rate on all samples, that is, maximize the product of the prediction success rate on all points. Using the maximum likelihood estimation method to solve this, let

h_{θ} (x) = \frac{1}{1 + e^{- x}}

. Then, the maximum likelihood estimation function is expressed as follows:

L (θ) = \prod_{i = 1}^{m} {(h_{θ} (x^{(i)}))}^{y^{(i)}} {(1 - h_{θ} (x^{(i)}))}^{1 - y^{(i)}}

(2)

We take the logarithm and negative sign on both sides at the same time to obtain:

J_{l o g} (ω) = \sum_{i = 1}^{m} - y_{i} log (p (x_{i}; ω)) - (1 - y_{i}) log (1 - p (x_{i}; ω))

(3)

Therefore, our goal is to minimize Equation (3), which is the loss function of logistic regression, also known as the cross-entropy loss function. Then, the gradient descent method is used to solve the minimum value and

ω

.

3.1.2. K-Nearest Neighbor

Compared with logistic regression, the k-nearest neighbors (KNN) algorithm is much simpler. Its main idea is to calculate the distance between other points near the selected point, select the k points with the shortest distance, make statistics on the classification types of these k points, and select the class with the highest frequency as the classification of the selected points. The distances calculated here are mainly the Euclidean distance and Manhattan distance, among which the Euclidean distance is more widely used.

E u c l i d e a n d i s t a n c e : d (x, y) = \sqrt{\sum_{k = 1}^{n} {(x_{k} - y_{k})}^{2}}

(4)

M a n h a t t a n d i s t a n c e : d (x, y) = \sqrt{\sum_{k = 1}^{n} | x_{k} - y_{k} |}

(5)

These two equations take the n-dimensional space as an example, where

x_{k}

and

y_{k}

are the k-th dimensional values of the two points, respectively.

3.1.3. Support-Vector Machine

The focus of an SVM classifier is to find a hyperplane that can separate two classes of samples. For samples with two-dimensional feature vectors, the hyperplane is a line. For the sample sets with 3D feature vectors, the hyperplane is a surface. Therefore, the dimension of the hyperplane is always 1 less than the dimension of the feature vector.

As shown in Figure 1, the margin refers to the sum of vertical distances from sample points of different classes on both sides of the hyperplane to the hyperplane. The process of finding the desired hyperplane is the process of finding the maximum margin. The reason it is necessary to maximize the margin is to enhance robustness and minimize the error rate of the classification. The two sample points that determine the maximum margin are called support vectors, which is also the origin of the name of support-vector machine.

3.1.4. Naive Bayes

The naive Bayes classifier is a classifier that uses Bayes theorem to solve the probability, so as to achieve the purpose of classification. “Naive” means that all features are considered to be independent of each other.

According to the Bayes rule, the equation can be regarded as:

p (c l a s s e s | f e a t u r e s) = \frac{p (f e a t u r e s | c l a s s e s) p (c l a s s e s)}{p (f e a t u r e s)}

(6)

Furthermore, according to the total probability theorem, the denominator on the right side can be calculated by the following equation:

P (B) = \sum_{i = 1}^{n} P (A_{i}) P (B | A_{i})

(7)

Finally, the probability belonging to a certain category is calculated, and the maximum value is taken as the classification result.

There are three different forms of naive Bayes classifiers, distinguished by whether the feature vector is continuous or discrete. Bernoulli naive Bayes classification is applicable in the condition that feature vectors conform to a Bernoulli distribution, that is, a binary distribution. Multinomial naive Bayes classification is suitable for discrete feature vectors and conforms to multivariate distributions. The last classification method, Gaussian naive Bayes classification, is used when the feature vectors are continuous variables and conform or approximate to a normal distribution. The S.M.A.R.T. features used in this paper are continuous variables, so a Gaussian naive Bayes classifier is used for the experiment.

3.2. Strong Learners

3.2.1. Random Forest

Random forest is a representative ensemble learning algorithm belonging to the family of bagging strategies. As its literally implies, it uses the decision trees as the base learner and constructs several (usually hundreds of) decision trees using a sampling method with placement. When classifying, the test samples will obtain the results on all the constructed decision trees and determine the category of the test samples through the “voting” strategy, that is, taking the mode. Each time the random forest algorithm is run, different samples are randomly selected to build different decision trees, and in the process of decision tree construction, the features selected and the features of split nodes are random. Such high randomness can avoid overfitting and improve the performance of the algorithm.

3.2.2. GBDT

GBDT (gradient-boosting decision tree) is one of the representative ensemble learning algorithms belonging to the family of boosting strategies [19]. It also uses the decision tree as the base learner and uses a serial operation method different from the bagging strategy (base learner runs in parallel) to complete the improvements to and updates of the model in continuous iterations. GBDT pays attention to the residual generated in the training process, that is, the gap between the training results between two layers. In the process of the serial operation of several decision trees, the error function constructed by the gradient is gradually optimized to make the overall model reach a higher level of prediction.

3.2.3. XGBoost

XGBoost (extreme gradient boosting) is an improved version based on GBDT, with the same basic idea behind the algorithm as GBDT. Unlike the first-order Taylor expansion loss function used by GBDT, XGBoost utilizes a second-order Taylor expansion, which, on the one hand, improves the accuracy of the loss function, and on the other hand, helps to customize more loss functions, making them easier to approximate. In addition, XGBoost also introduces a regular term in the objective function to avoid overfitting. Compared with the basic learner of GBDT that can only use CART (classification and regression tree), XGBoost can also use a linear classifier as the basic learner, which is more flexible.

XGBoost also has operations for pre-ordering features. The features are pre-ordered and stored in a dedicated cache, so that the required features can be extracted directly from the cache and calculated when the tree structure is built. This allows the computation to run in parallel and also increases the space complexity. In general, XGBoost is an engineering implementation of GBDT, which is a relatively complete method based on the original algorithm with many optimizations.

3.2.4. AdaBoost

AdaBoost (adaptive boosting) was proposed in 1997 and is the first proposed ensemble learning algorithm to use a boosting strategy [20]. The basic idea is to increase the weight of the misclassified sample and correspondingly decrease the weight of the correctly classified sample during each serial iteration. Such an iterative method can make the model pay more attention to the samples that are classified incorrectly in the training and can better reduce the error and improve the performance of the model.

Generally speaking, AdaBoost also uses the decision tree model as the base learner for training. This training strategy can significantly improve the training level and efficiency of the base learner without causing overfitting. However, the disadvantage of this method is that it will over-amplify the importance of some anomalous points, or noise, so it is greatly affected by noise.

3.3. BP Neural Network

The BP (back-propagation) neural network method is an optimized version of the MLP (multilayer perceptron) method, created by adding a step of backward propagation to update parameters on the premise of MLP.

As shown in Figure 2, BP neural networks are divided into three layers: input layer, hidden layer and output layer, among which multiple hidden layers can be placed in series. The weight and bias of the initial cell can be generated by a random seed. The input features are firstly propagated forward through the linear transformation of the hidden layer and activated by tanh and sigmoid activation functions. Then, the error between the output result and the real value (such as MSE) is calculated, and the gradient descent is used to update the weights and bias to complete a backward propagation. After n iterations of the forward and backward propagation process, the error can be reduced to a very small value, and the final parameters can predict the results well.

3.4. LSTM

LSTM (long short-term memory) is a variant of RNN (recurrent neural network) created after solving the problem of gradient explosion and gradient disappearance. LSTM was first proposed in 1997 [21] and was not widely used and further improved until the 21st century, when machine learning developed rapidly.

Although many advanced neural networks have been proposed in the last decade, such as transformer and its many variations, LSTM is still favored by many people because of its excellent learning ability and relatively less complex structure. Thus far, LSTM is still the primary choice for processing time series data. Compared with an RNN, LSTM introduces the structure of many gated units, such as forget gates, input gates, and output gates, to process time series data [22,23].

The structure of an LSTM unit is shown in Figure 3. Different from RNN, which only has one transfer state per cycle, LSTM adds the concept of cell state, which can store useful information of previous training data for a long time, while the hidden state pays more attention to recent information. Therefore, each LSTM cell has three inputs, including data at the current time point, cell status, and hidden status at the previous time point, and two outputs, namely cell status and hidden status at the current time point. It is precisely because of these characteristic gated units that LSTM is able to process data with a long time span, screen out useful information, and discard useless information so as to occupy a small amount of resources and process a large amount of data.

The inputs in Figure 3 are the same; that is, they represent the combination of the hidden state of the last time point, the input value of this time point (for example, if the data is in matrix form, it is spliced), and the cell state at the last time point. The one on the far left, called the forget gate, is designed to select information to be forgotten or information that is not very useful for training. Its calculation equation is as follows:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})

(8)

where

σ

is a sigmoid activation function, which maps the value to the range of (0, 1).

W_{f}

and

b_{f}

set the weight and bias for the forget gate, respectively.

To the left of the forget gate is the input gate. Opposite to the forget gate, the role of this gated unit is to calculate the information to be retained. The calculation equation is as follows:

i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})

(9)

C_{t}^{^{'}} = t a n h (W_{C} \cdot [h_{t - 1}, x_{t}] + b_{C})

(10)

where it can control which information is retained, while

C_{t}^{^{'}}

use a tanh activation function to calculate the input information at this time point.

After the calculation of the above three equations, the cell state can be updated. The calculation equation is as follows:

C_{t} = f_{t} * C_{t - 1} + i_{t} * C_{t}^{^{'}}

(11)

The meaning of this equation is that the cell status update of this time point is divided into two steps. The first step is to multiply (pointwise multiplication) the result obtained by the forget gate with the cell status of the previous time point to filter out the previous useless information. The second step is to multiply the two output results of the input gate to obtain the information that should be valued at this time point. The results of these two steps are then added together to achieve the goal of forgetting the old useless information and adding new information.

The updated cell state is the first output value of LSTM, and the second output value is the hidden state ht of this time point. The calculation equation also includes two parts:

o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})

(12)

h_{t} = o_{t} * t a n h (C_{t})

(13)

In this case,

h_{t}

is the second output of the cell, containing the useful information of the previous time point after filtering and the information of the local point.

3.5. Method Framework

The experimental structure designed based on the above classifiers and blending ensemble learning method is shown in Figure 4.

In this paper, two classes of blending ensemble learning model frameworks are constructed using the framework shown in Figure 4 and the classification algorithms described above. The first one uses random forest, GBDT, XGBoost and AdaBoost as the first layer of basic learners, and the second method uses logistic regression, k-nearest neighbors, support-vector machine, and a Gaussian naive Bayes classifier as the first layer of base learners. It then uses their prediction results as new features to build new training sets and test sets. Then, a BP neural network is used as the second layer of learners to train and test the new datasets, and the final results are obtained and compared with the experimental results using traditional methods. The reason why base learners are divided into two categories is that the first type of base learner is highly efficient and can achieve a high level of classification effect, while the second type of base learners is a relatively weak classifier. Two groups of basic learners with large classification performance gaps are used for our experiments, which can better highlight the effectiveness of the blending ensemble learning method.

In addition, by observing the intermediate results of the experiments, we also replace the base learners that have the least impact on the result of the ensemble model in the two groups with the LSTM network, so as to further verify the effectiveness of the blending learning method and seek better model performance at the same time. Among them, the performance of XGBoost is highly similar to that of GBDT. In this experiment, the effect difference between the two algorithms is small, so XGBoost was replaced. In the other group, the results of the Gaussian naive Bayes classifier were so unstable that we decided to replace it to compare the experimental results.

4. Experiment

4.1. Dataset

The dataset used in the experiment comes from the public hard disk dataset on the BackBlaze official website. BackBlaze has been compiling and reviewing hard disk data since 2013, posting quarterly and year-end statistics on hard disk usage from various manufacturers and comparing them with previous years to draw conclusions. They also made the original hard disk dataset available for viewing and research.

The data items included in the BackBlaze hard disk dataset are shown in Table 2. Among them, date, serial number, model, capacity and failure or not of a hard disk are always present and have the same meaning for all hard disks, while the following S.M.A.R.T. feature sets vary by the hard disk model. Because different hard disk manufacturers have different definitions and emphasis on their own hard disk S.M.A.R.T. features, it is normal that the same S.M.A.R.T. features have different meanings for hard disks of different manufacturers. Therefore, on the premise that the failed samples should not be too small and the disk failure rate should be within the normal range, this experiment selects the ST4000DM000 hard disk of the Seagate Company for more than six years from 2016 to 2022 as the experimental object. The statistical data of some years are shown in Table 3.

Since the year 2022 has not yet ended, only the data of the first two quarters, namely from January to June, were collected in 2022, with a total of 268 failed samples, and the annual failure rate is not yet known.

As can be seen from the statistics, the annual failure rate of ST4000DM000 disk data in these six years is basically around 2%. According to statistics, the annual failure rate of hard disks under normal circumstances should be between 0.5% and 2%. Therefore, the reason for choosing this model of hard disk is that its failure rate is always at a high level, the failed samples are more abundant, and the number of data meets our requirements.

4.2. Data Preprocessing

Because too many healthy samples cannot play a significant role in failure prediction, the experimental data selected in this experiment are all failed samples and the samples 14 days before the anomaly, so the ratio of healthy samples to failed samples is 14:1.

4.2.1. Feature Selection

Since each S.M.A.R.T. feature in the dataset is divided into raw values and normalized values, and we are not clear about the standard of each hard disk manufacturer for normalizing the features, we discard the normalized values, retain the raw values, and normalize with our own standard.

After the data of the disk model ST4000DM000 is screened out, all normalized values and the feature columns represented as completely empty were firstly removed (because the statistics of all disk models were collected together, there would be features owned by one model of hard disk but not owned by another model, which were all empty for the latter). Next, we standardized the data in blocks (detailed in Section 4.2.2) and deleted all zero columns after standardization. Because the data were collected quarterly, the intersection of features of the four quarters was taken after processing the data separately, so the remaining 15 features were left. Violin plots and box plots of these 15 features were drawn based on the criterion of failure or not (in fact, these two kinds of graphs have the same expression meaning), and 10 features, which were more distinguishable, were selected. Finally, in order to prevent feature redundancy, a heat map of these 10 features was drawn. One or more features were selected from the feature groups with strong correlation among these features, and five features were finally used in the experiment.

In essence, each feature of different types of data is numerically distinguished according to the category of violin plot and box plot, and boundary data such as median and two quarter quantiles of each feature points of each type of data are marked. Through the visualization of these data, the numerical differences between samples of different categories can be easily detected. The difference between the two charts is that the violin plot can reflect the numerical quantity distribution. The box plot can better distinguish the quantile. As shown in Figure 5, all features are clustered between 0 and 1 because they were normalized prior to drawing. According to these two plots, we can further select 10 features from these 15 features that can distinguish healthy and failed samples more numerically.

Figure 6 shows the heat map drawn with 15 features preliminarily screened, and the color in each cell represents the correlation degree between the corresponding horizontal and vertical features. The darker the cell, the stronger the correlation between the two features, and the weaker the correlation for the converse situation. As shown in Figure 6, the three features No. 240, 241, and 242 are taken as an example. They are highly correlated with each other, indicating that they represent basically the same meanings. The significance of drawing a heat map is to select one or two of the highly correlated features as the final selection, so as to avoid wasting additional resources in the training process due to feature redundancy.

The specific meanings of the five S.M.A.R.T. features selected after the above processing are as follows:

S.M.A.R.T. 1: Raw read error rate. This attribute refers to the number of errors that occur when the head reads data on the disk surface. It is generally 0. This attribute may be a large value for Seagate hard disks, but it is normal and is fine as long as it does not increase any more.
S.M.A.R.T. 7: Seek error rate. This attribute represents the number of errors generated by the head in the seek process. Generally, it is 0. There are many factors that can cause this value to rise, and they are often hardware-level reasons, including disk surface media problems and temperature anomalies. Similar to the No. 1 feature, for Seagate, this feature of the new hard disk may also be a large value, which will normally decrease in the future.
S.M.A.R.T. 187: Reported uncorrectable errors. A parameter exclusive to Seagate hard disks, which can be understood literally, it represents the number of errors that cannot be solved by the hardware reported to the operating system. If this value is non-zero, it means that the user needs to back up the hard disk data.
S.M.A.R.T. 198: Offline uncorrectable sector count. This feature is the cumulative count of uncorrectable errors in the read/write sector. The data should be 0. If this parameter is not zero and is rising, then a mechanical problem has occurred, and a sector must be damaged. If a file is working on the damaged sector, the operating system will return the disk read error message and remap the sector the next time it is written to.
S.M.A.R.T. 240: Head flying hours. The parameter that increases with time, such as the power on time and the total number of reads and writes, can be approximated to the age of the hard disk and has no other special significance.

4.2.2. Standardization

Although the S.M.A.R.T. features of the same model of hard disks are all the same, the range of values of the same model of hard disks with different serial numbers varies greatly. If the data are normalized in a conventional way, they will lose their original characteristics. According to the experiment results achieved earlier, the training effect was very poor if the features of all the data were standardized together to process the data.

In view of the above problems, the data in this experiment were standardized in blocks, that is, the time series data of each hard disk were processed separately, and finally all the standardized data were spliced together. This standardization method has been proven effective in previous work [2].

As shown in Figure 7, it is assumed that this column of data contains several groups of features with a hard disk time period of four. Taking the data of two hard disks as an example, the left figure shows the results of conventional standardization, that is, all data are standardized together, and there will be a significant gap between high and low values (the gap is larger in the real data). The right figure is the result of block standardization, that is, if the data of each hard disk are standardized separately, the range of standardization results will be in a stable range, which can reflect the change trend of features of a single hard disk in the time cycle. This is more conducive to data processing, and the model will also learn better.

4.3. Evaluation Criteria

For the imbalanced data of normal and abnormal samples, the Matthews correlation coefficient is a more appropriate choice than other evaluation criteria, including accuracy, recall, and F1 score [6].

The Matthews correlation coefficient takes full account of true false-positive and true false-negative values to obtain a well-balanced index. Its value ranges from −1 to 1, and a value of −1 means the predicted value is the exact opposite of the actual value. When the value is 0, the prediction effect is worse than the random guess. The closer the value is to 1, the better the prediction effect is. When it is 1, it means that it is a perfect classifier, although a perfect classifier does not exist. The equation for calculating the Matthews correlation coefficient is as follows:

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(14)

where TP, FP, FN, and TN, respectively, represent true-positive, false-positive, false-negative, and true-negative values.

4.4. Experiments and Results

The software environment of the experiment is Windows 10, which is equipped with an Intel Core i5-11260H [email protected], 16G RAM, and an NVIDIA GeForce RTX 3050 laptop GPU. The deep learning algorithm used is Pytorch 1.10.2.

Using the failed samples in the Seagate hard disks with model ST4000DM000 and the samples of 14 days before the anomaly, the time span is from 2016 to 2022. A total of 12 sets of experiments without cross-data were constructed to explore the experimental results under different ratios of training set to test set. After the same standardized processing of data, random forest, GBDT, XGBoost, and AdaBoost classifiers were used as the first layer, and a BP neural network was used as the second layer for experiments and recorded as blending group A. Logistic regression, k-nearest neighbors, support-vector machine, and a Gaussian naive Bayes classifier were used as the first layer, and a BP neural network was used as the second layer for experiments and recorded as glending group B. The results of these two groups of experiments were compared with those obtained by using traditional methods.

Before that, in order to further verify the validity of our experiment results, we tried to reproduce a three-layer stacked LSTM model [6] for comparison experiments, but as shown in Figure 8, its loss function presented an extremely unstable state, and the vibration was relieved when it was reduced to two layers.

Therefore, we modified it. After processing with only one layer of LSTM and one layer of dropout, the loss function image behaved much more normally (as shown in Figure 9), and the test results were better than the complex model. It is speculated that the form of the data after standardization was too simple, leading the complex model to appear to have a serious shock peak, resulting in the reduced performance of the model; therefore, the simpler model was deemed more suitable for processing this kind of data.

In addition, to further verify the rationality of the blending ensemble learning method, we added LSTM to the base learners to obtain better effects and criterion. We set up two additional experimental groups by removing the classifier with the least stable classification effect or least impact on the final result from the base learners of the original two groups of Blending experimental groups and replacing it with LSTM, thus obtaining Blending group C and Blending group D, respectively. Based on the observation and analysis of the intermediate results in the experimental process, we replaced the XGBoost in the base learner of blending group A with LSTM and the Gaussian Naive Bayes classifier in the base learner of blending group B with LSTM, thus obtaining two new ensemble learning models. Table A1 and Table A2 in the appendix give the parameter values of some algorithms, and the algorithms not mentioned use default parameters.

Experiments were carried out after everything was ready. The datasets of each year after processing were divided into 12 groups, and repeated experiments were conducted to make the results more convincing through averaging. The experimental results are shown in Table 4 and Figure 10.

From the experiment results, we can see that the blending ensemble learning method has significantly improved the evaluation criterion. In most experimental groups, LSTM model experimental results are better than using BP neural networks alone but not better than blending ensemble learning methods. In addition, from the results in Table 4, in most cases, it can be seen that LSTM, when added to the base learner, can significantly improve the performance. The line chart in Figure 11 more clearly shows the effect changes after the addition of the LSTM to base learners, which proved the effectiveness of this method. However, it should be noted that there are cases in which the LSTM model has the best effect in some individual experimental groups. We guess that under such dataset processing, if the ratio of the training set to the test set is too large or too small, the performance of the LSTM model will be poor or the results will not be accurate. Because the total amount of datasets is fixed, when the proportion of training sets is too large, the number of test sets is too small, which will lead to the insufficient persuasiveness of test results. In contrast, when the proportion of test sets is large, too few training sets will result in insufficient model training. When the ratio is between 1:1 and 1:0.5, the performance of the LSTM network can be maximized, and when the ratio is around this, the effect of the model is more in line with expectations. However, this conclusion is only a guess and needs to be verified by subsequent experiments.

In addition, compared with the direct use of the BP neural network method, the blending ensemble learning method will also greatly reduce the running time and improve the efficiency of training and testing. Table 5 shows the training time of each model in some experimental groups. The BP neural networks used are all 10,000 epochs.

The reason for this gap in training time is that the size of the new training set used in the vlending ensemble learning method is reduced. Because it is a part of the original training set, the size of the training set is actually the size of the split validation set, and the complexity of the new training set and test set is also reduced.

Finally, to test the universality of our method, we counted data for six different models of hard disks. In addition to using Seagate hard disks from the same manufacturer as the training data, we counted hard disks from two different hard disk manufacturers, Hitachi and Toshiba. The statistics for the six groups of hard disks used for testing are shown in Table 6. These six different models of hard disks cover three manufacturers and have capacities of 4TB, 8TB, 12TB and 16TB, and the number of failed hard disks counted ranges from 79 to nearly 1000.

The datasets of the six new models counted were feature-selected and preprocessed in the same way, and all the hard disk data of model ST4000DM000 used for the experiments above were used as the training set to test the six new datasets in Table 6. The results obtained are shown in Table 7.

As can be seen from the results obtained in Table 7, the model generalizes well, and compared with the model obtained by the ensemble learning method, the results of the traditional machine learning or deep learning models, such as random forest, the BP neural network, and LSTM, show poor robustness. The most obvious improvement of the LSTM on the blending method is the experiment on a hard disk from the Hitachi manufacturer. It is worth mentioning that the S.M.A.R.T. feature sets filtered out by Hitachi hard disks have a large gap compared to those of Seagate, so the initial performance is poor. However, since the vast majority of S.M.A.R.T. features are the same for different models of hard disks, aligning the features after comparing previously obtained heatmaps (as shown in Figure 6) can greatly improve the test results. Therefore, when testing other models of hard disks, the most important thing is to align the features with high correlation among the selected features, so that good results can be obtained.

Taking into account all the above experimental results, it can be seen that the blending ensemble learning method uses a smaller training set to achieve better effect, and the running time is greatly reduced. In addition, experiments with other models of hard disks including the same and different manufacturers showed that the Blending ensemble learning method does not show overfitting and has good universality. In summary, several experimental results prove the rationality and effectiveness of this method.

4.5. Looking for the Best Match

In the above, we have proved that the blending ensemble learning method can improve the performance of the model. Next, we aim to find a combination of base learners that can achieve good performance on many hard disk models.

Since each base learner has different detection strengths and accuracies for positive and negative samples, the ensemble learning method is comprised of the performance of each base learner, which can learn from each other and complement each other to achieve a more stable and efficient model. Taking the hard disk model ST8000DM002 as an example, the number of false positives and false negatives in the results of each base learner is shown in Table 8.

From Table 8, we can see that the random forest method has the best and very high anomaly detection accuracy, but the disadvantage is that the number of false-negative samples is too high. Therefore, we take random forest as one of the base learners, and then select the three classification models with the lowest false negative rate, namely LSTM, GBDT, and SVM, as the other three base learners. We use these four models to form a new ensemble learning model to test the test set, and the number of false positives and false negatives are 9 and 23, respectively. The Matthews correlation coefficient is 0.9723992. Compared with the random Forest model alone, although the number of false positives is slightly increased, the number of false negatives is greatly reduced, and the comprehensive performance exceeds each individual model. The experiment results proved our idea, and the following experiment continued according to this idea.

According to the above verification idea, experiments were carried out on the other five models of hard disks, and Table 9 was obtained.

By observing the confusion matrices obtained from each method experimented on each hard disk model, we found that their performance bias is basically stable, while logistic regression and KNN basically always have a large number of false positives and false negatives, so we excluded these two classifiers when experimenting. We experimented with multiple ensemble combinations based on the results of the statistics, and the results are shown in Table 9. It can be seen that the combination of base learners in the last column achieved good results on the six hard disk datasets. It is worth noting that the experiment of AdaBoost on Hitachi hard disk HMS5C4040BLE640 achieved high performance that was not available on other methods, so in the combinations without AdaBoost, the results on this hard disk model were poor. In addition, for the experiments on the Toshiba hard disk MG07ACA14TA, the results of each combination were actually only one or two samples apart. Therefore, on the whole, the combination of random forest, SVM, AdaBoost, and LSTM gave the best results on most models, and the second-best results were similar.

5. Conclusions and Prospect

In this paper, we proposed a new failure prediction method based on the blending ensemble learning method, combining machine learning and deep learning networks to solve the problem of hard disk failure prediction, and proved its rationality and effectiveness through several groups of experiments. In the preliminary experiment, we constructed some groups of blending ensemble learning models according to the performance of the base learner and carried out more than ten groups of experiments using the public dataset collected by the BackBlaze Company. The experimental results were compared with the results of using traditional methods, including a BP neural network, random forest, and LSTM alone, with less training sets and training time, and better results were obtained. Next, we conducted experiments on generalization.

We use violin plots and heat maps to screen features and align features according to the correlation of each feature in the heat map, which greatly improves the generalization capability of the model.
In order to find a model that can achieve good results on all models of hard disk datasets, we tried various combinations of base learners after studying the confusion matrices obtained by each base learner on each dataset and finally found an ensemble model that has the best generalization capability.

The next step of the current plan is to try to predict the remaining life of a hard disk, rather than simply classify the hard disk data as healthy or failed as a binary classification problem. In addition to this, trying to experiment with more advanced models is also the top priority of future work.

Author Contributions

Conceptualization, M.Z. and P.L.; methodology, M.Z.; software, W.G.; resources, M.Z.; data curation, M.Z.; writing—original draft preparation, M.Z.; writing—review and editing, M.Z. and P.L.; supervision, P.L. and R.T.; project administration, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://www.backblaze.com/.

Acknowledgments

We are grateful to the anonymous reviewers for comments on the original manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Main parameter values used in some machine learning methods.

	n_Estimators	Max_Leaf _Nodes	Min_Samples _Split	Learning _Rate	Min_Samples _Leaf	Max _Depth
RF	10	6	12	-	-	-
GBDT	400	-	2	0.01	4	3
AdaBoost	170	-	-	0.2	-	2
XGBoost	1100	-	-	0.01	-	5

Table A2. Main parameter values used in deep learning methods.

	n_Input	n_Hidden	n_Output	Epoch	Learning_Rate
BPNN	5	6	1	10,001	0.1
LSTM	5	12	1	20,001	0.005

Those marked with ’-’ in the above two tables indicate that the algorithm does not have this parameter or uses the default parameter.

References

Lh, A.; Lh, A.; Zx, B.; Tj, C.; Hq, A. A disk failure prediction method based on LSTM network due to its individual specificity. Procedia Comput. Sci. 2020, 176, 791–799. [Google Scholar]
Coursey, A.; Nath, G.; Prabhu, S.; Sengupta, S. Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional LSTM Networks. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 4832–4841. [Google Scholar]
Hughes, G.F.; Murray, J.F.; Kreutz-Delgado, K.; Elkan, C. Improved disk-drive failure warnings. IEEE Trans. Reliab. 2002, 51, 350–357. [Google Scholar] [CrossRef] [Green Version]
Wang, Y.; Jiang, S.; He, L.; Peng, Y.; Chow, T.W. Hard Disk Drives Failure Detection Using A Dynamic Tracking Method. In Proceedings of the 2019 IEEE 17th International Conference on Industrial Informatics (INDIN), Helsinki, Finland, 22–25 July 2019. [Google Scholar]
Yang, Q.; Jia, X.; Li, X.; Feng, J.; Lee, J. Evaluating Feature Selection and Anomaly Detection Methods of Hard Drive Failure Prediction. IEEE Trans. Reliab. 2020, 70, 749–760. [Google Scholar] [CrossRef]
Cahyadi; Forshaw, M. Hard Disk Failure Prediction on Highly Imbalanced Data using LSTM Network. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 3985–3991. [Google Scholar]
Burrello, A.; Pagliari, D.J.; Bartolini, A.; Benini, L.; Macii, E.; Poncino, M. Predicting Hard Disk Failures in Data Centers Using Temporal Convolutional Neural Networks. In Proceedings of the Euro-Par 2020: Parallel Processing Workshops, Warsaw, Poland, 24–25 August 2020; Balis, B., Heras, B.D., Antonelli, L., Bracciali, A., Gruber, T., Hyun-Wook, J., Kuhn, M., Scott, S.L., Unat, D., Wyrzykowski, R., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 277–289. [Google Scholar]
Luo, C.; Zhao, P.; Qiao, B.; Wu, Y.; Zhang, H.; Wu, W.; Lu, W.; Dang, Y.; Rajmohan, S.; Lin, Q.; et al. NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms. In Proceedings of the WWW ’21: The Web Conference 2021, ACM/IW3C2, Virtual Event/Ljubljana, Slovenia, 19–23 April 2021; Leskovec, J., Grobelnik, M., Najork, M., Tang, J., Zia, L., Eds.; pp. 1181–1191. [Google Scholar]
Lu, S.; Luo, B.; Patel, T.; Yao, Y.; Tiwari, D.; Shi, W. Making Disk Failure Predictions SMARTer. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST ’20), Santa Clara, CA, USA, 25–27 February 2020. [Google Scholar]
Hai, Q.; Zhang, S.; Liu, C.; Han, G. Hard Disk Drive Failure Prediction Based on GRU Neural Network. In Proceedings of the 2022 IEEE/CIC International Conference on Communications in China (ICCC), Sanshui, Foshan, China, 11–13 August 2022; pp. 696–701. [Google Scholar]
Ahmad, W.; Khan, S.A.; Kim, C.H.; Kim, J.M. Feature Selection for Improving Failure Detection in Hard Disk Drives Using a Genetic Algorithm and Significance Scores. Appl. Sci. 2020, 10, 3200. [Google Scholar] [CrossRef]
Kearns, M.; Valiant, L.G. Learning Boolean Formulae or Finite Automata is as Hard as Factoring; Technical Report TR-14-88; Harvard University Aikem Computation Laboratory: Cambridge, MA, USA, 1988. [Google Scholar]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Schapire, R.E. The strength of weak learnability. Proc. Second. Annu. Workshop Comput. Learn. Theory 1989, 5, 197–227. [Google Scholar]
Wolpert, D.H. Stacked Generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Tan, Y.; Chen, H.; Zhang, J.; Tang, R.; Liu, P. Early Risk Prediction of Diabetes Based on GA-Stacking. Appl. Sci. 2022, 12, 632. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2000, 29, 1189–1232. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Bai, A.; Chen, M.; Peng, S.; Han, G.; Yang, Z. Attention-Based Bidirectional LSTM With Differential Features For Disk RUL Prediction. In Proceedings of the 2022 IEEE 5th International Conference on Electronic Information and Communication Technology (ICEICT), Hefei, China, 21–23 August 2022; pp. 684–689. [Google Scholar]
Wang, G.; Wang, Y.; Sun, X. Multi-Instance Deep Learning Based on Attention Mechanism for Failure Prediction of Unlabeled Hard Disk Drives. IEEE Trans. Instrum. Meas. 2021, 3513509. [Google Scholar] [CrossRef]

Figure 1. “Margin” in support-vector machines.

Figure 2. A Simple BP Neural Network Framework.

Figure 3. The structure of an LSTM cell.

Figure 4. Blending ensemble learning method framework.

Figure 5. (a) Violin plot based on 15 features preliminarily selected. (b) Box plot based on 15 features preliminarily selected.

Figure 6. Heat map based on 15 features preliminarily selected.

Figure 7. (a) Previous common standardization method. (b) Block standardization method.

Figure 8. (a) Loss function change curve of three-layer stacked LSTM model. (b) Loss function change curve of two-layer stacked LSTM model.

Figure 9. Loss function change curve of improved LSTM model.

Figure 10. Comparison histogram of experimental results of models under different ratios of training set to test set.

Figure 11. (a) Comparison of experimental results between blending group A and blending group C. (b) Comparison of experimental results between blending group B and blending group D.

Table 1. Performance of related work.

Method	Precision	Recall	FAR	FDR	MCC	MAE
Mathematical Statistical [3]	40–60%	-	0.2–0.5%	-	-	-
Switchable state-space degradation model [4]	-	-	0.83%	97.44%	-	-
LSTM [1]	-	-	1.3%	85%	-	-
LSTM [6]	-	-	-	-	0.71	-
Bi-LSTM [2]	-	-	-	-	-	0.12
TCN [7]	91%	-	0.05%	89.1%	-	-
NTAM [8]	84.01%	76.43%	-	-	-	-
LSTM [9]	65%	89%	-	-	0.74	-
CNN-LSTM [9]	95%	95%	-	-	0.95	-

Table 2. Selected portion of the BackBlaze hard disk dataset.

Data	Serial_Number	Model	Capacity_Bytes	Smart_1_Normalized	Smart_1_Raw
2022/6/21	ZLW18P9K	ST14000NM001G	$1.4 \times 10^{13}$	81	1.31 × 10⁸
2022/6/21	71R0A0QBFV8G	TOSHIBA MG08ACA16TEY	$1.6 \times 10^{13}$	100	0
2022/6/21	ZA1FLE1P	ST8000NM0055	$8 \times 10^{12}$	71	12,339,896
2022/6/21	ZA16NQJR	ST8000NM0055	$8 \times 10^{12}$	79	74,303,920
2022/6/21	1050A084F97G	TOSHIBA MG07ACA14TA	$1.4 \times 10^{13}$	100	0
2022/6/21	PL1331LAHEYUGH	HGST HMS5C4040BLE640	$4 \times 10^{12}$	100	0
2022/6/21	ZA130TTW	ST8000DM002	$8 \times 10^{12}$	79	88,707,864

Table 3. Statistics of ST4000DM000 hard disk.

Year	Drive Count	Drive Failures	Annualized Failure Rate
2016	34,738	938	2.7%
2017	32,070	1017	3.17%
2018	23,236	581	2.5%
2019	19,211	402	2.09%
2020	18,939	269	1.42%
2021	18,611	339	1.82%

Table 4. Comparison of experimental results of models under different ratios of training set to test set (The values in bold are the best results under the same conditions).

Ratio	Blending Group A	Blending Group C	Blending Group B	Blending Group D	BPNN	LSTM
1:13.25	0.9709028	0.9706212	0.9736099	0.9498740	0.9713755	0.9484344
1:5.3	0.9696904	0.9702627	0.9751970	0.9649997	0.9734079	0.9616863
1:3.3	0.9736247	0.9739757	0.9456301	0.9586547	0.9705973	0.9717738
1:3.18	0.9586928	0.9586737	0.9294569	0.9445184	0.9078211	0.9357074
1:2	0.9629371	0.9701268	0.9718837	0.9768089	0.9662840	0.9688110
1:1.06	0.9656653	0.9656653	0.9697747	0.9716215	0.9599846	0.9641146
1:0.945	0.9697842	0.9747204	0.9660113	0.9765305	0.9549722	0.9790189
1:0.5	0.9792861	0.9817062	0.9559330	0.9751013	0.9549136	0.9814030
1:0.314	0.9432324	0.9434551	0.9667437	0.9667437	0.9404691	0.9453640
1:0.3	0.9805881	0.9776981	0.9605739	0.9772345	0.9503967	0.9791860
1:0.189	0.9713094	0.9731593	0.9518142	0.9611000	0.9380961	0.9702831
1:0.075	0.9670789	0.9617624	0.9434270	0.9534262	0.9357423	0.9633937

Table 5. Partial comparison of model training time (The data in bold is the shortest running time under the same conditions).

Ratio	Blending Group A	Blending Group C	Blending Group B	Blending Group D	BPNN
1:13.25	21s753ms	23s248ms	20s600ms	21s857ms	135s844ms
1:1.06	182s267ms	172s410ms	188s113ms	178s105ms	942s349ms
1:0.075	350s698ms	366s672ms	355s793ms	317s722ms	1664s662ms

Table 6. Statistics of other models of hard disks used for testing.

Model	Manufacturer	Capacity	Number of Failed Hard Disks
ST8000DM002	Seagate	8TB	613
ST8000NM0055	Seagate	8TB	988
ST12000NM0008	Seagate	12TB	661
HMS5C4040BLE640	Hitachi	4TB	305
MG07ACA14TA	Toshiba	12TB	625
MG08ACA16TE	Toshiba	16TB	79

Table 7. Test results on different types of hard disk (The values in bold are the best results under the same conditions).

Model	Blending Group A	Blending Group C	Blending Group B	Blending Group D	BPNN	LSTM	RF
ST8000DM002	0.9644412	0.9679243	0.9627051	0.9709060	0.9167915	0.9698113	0.9400783
ST8000NM0055	0.9661928	0.9713085	0.9551134	0.9657260	0.9419506	0.9665092	0.9521076
ST12000NM0008	0.9398109	0.9440028	0.9161423	0.9256142	0.8900800	0.9330701	0.9191321
HMS5C4040BLE640	0.8307019	0.9761437	0.8081199	0.8075408	0.6542020	0.8029737	0.6737136
MG07ACA14TA	0.9838513	0.9829368	0.9763159	0.9792056	0.9733721	0.9828550	0.9805340
MG08ACA16TE	0.9211962	0.9185203	0.9211962	0.9190588	0.8851853	0.9159097	0.9105094

Table 8. Individual test results for each model on the hard disk of model ST8000DM002.

	RF	AdaBoost	GBDT	LSTM	LR	SVM	KNN
False Positive	7	22	16	12	46	10	31
False Negative	66	25	23	25	29	23	38

Table 9. Comparison of experimental results of several basic learner combinations on various hard disk models (The values in bold are the best results under the same conditions).

Model	Best Result Before	RF + SVM + GBDT + LSTM	RF + SVM + AdaBoost	RF + SVM + AdaBoost + LSTM
ST8000DM002	0.9709060	0.9723992	0.9715661	0.9723992
ST8000NM0055	0.9713085	0.9760211	0.9742720	0.9760211
ST12000NM0008	0.9440028	0.9448954	0.9434789	0.9448954
HMS5C4040BLE640	0.9761437	0.8327314	0.9576024	0.9735438
MG07ACA14TA	0.9838513	0.9826042	0.9821888	0.9826045
MG08ACA16TE	0.9211962	0.9211962	0.9158109	0.9211962

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, M.; Ge, W.; Tang, R.; Liu, P. Hard Disk Failure Prediction Based on Blending Ensemble Learning. Appl. Sci. 2023, 13, 3288. https://doi.org/10.3390/app13053288

AMA Style

Zhang M, Ge W, Tang R, Liu P. Hard Disk Failure Prediction Based on Blending Ensemble Learning. Applied Sciences. 2023; 13(5):3288. https://doi.org/10.3390/app13053288

Chicago/Turabian Style

Zhang, Mingyu, Wenqiang Ge, Ruichun Tang, and Peishun Liu. 2023. "Hard Disk Failure Prediction Based on Blending Ensemble Learning" Applied Sciences 13, no. 5: 3288. https://doi.org/10.3390/app13053288

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hard Disk Failure Prediction Based on Blending Ensemble Learning

Abstract

1. Introduction

1.1. Background

1.2. Related Work

2. Ensemble Learning

3. Materials and Methods

3.1. Week Learners

3.1.1. Logistic Regression

3.1.2. K-Nearest Neighbor

3.1.3. Support-Vector Machine

3.1.4. Naive Bayes

3.2. Strong Learners

3.2.1. Random Forest

3.2.2. GBDT

3.2.3. XGBoost

3.2.4. AdaBoost

3.3. BP Neural Network

3.4. LSTM

3.5. Method Framework

4. Experiment

4.1. Dataset

4.2. Data Preprocessing

4.2.1. Feature Selection

4.2.2. Standardization

4.3. Evaluation Criteria

4.4. Experiments and Results

4.5. Looking for the Best Match

5. Conclusions and Prospect

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI