An Ensemble Framework to Improve the Accuracy of Prediction Using Clustered Random-Forest and Shrinkage Methods

Farhadi, Zari; Bevrani, Hossein; Feizi-Derakhshi, Mohammad-Reza; Kim, Wonjoon; Ijaz, Muhammad Fazal

doi:10.3390/app122010608

Open AccessArticle

An Ensemble Framework to Improve the Accuracy of Prediction Using Clustered Random-Forest and Shrinkage Methods

¹

Department of Statistics, Faculty of Mathematics, Statistics and Computer Sciences, University of Tabriz, Tabriz 51666, Iran

²

Department of Computer Engineering, Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz 51666, Iran

³

Division of Future Convergence (HCI Science Major), Dongduk Women’s University, Seoul 02748, Korea

⁴

Department of Intelligent Mechatronics Engineering, Sejong University, Seoul 05006, Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(20), 10608; https://doi.org/10.3390/app122010608

Submission received: 10 September 2022 / Revised: 9 October 2022 / Accepted: 13 October 2022 / Published: 20 October 2022

(This article belongs to the Special Issue Decision Support Systems for Disease Detection and Diagnosis)

Download

Browse Figures

Versions Notes

Abstract

:

Nowadays, in the topics related to prediction, in addition to increasing the accuracy of existing algorithms, the reduction of computational time is a challenging issue that has attracted much attention. Since the existing methods may not have enough efficiency and accuracy, we use a combination of machine-learning algorithms and statistical methods to solve this problem. Furthermore, we reduce the computational time in the testing model by automatically reducing the number of trees using penalized methods and ensembling the remaining trees. We call this efficient combinatorial method “ensemble of clustered and penalized random forest (ECAPRAF)”. This method consists of four fundamental parts. In the first part, k-means clustering is used to identify homogeneous subsets of data and assign them to similar groups. In the second part, a tree-based algorithm is used within each cluster as a predictor model; in this work, random forest is selected. In the next part, penalized methods are used to reduce the number of random-forest trees and remove high-variance trees from the proposed model. This increases model accuracy and decreases the computational time in the test phase. In the last part, the remaining trees within each cluster are combined. The results of the simulation and two real datasets based on the WRMSE criterion show that our proposed method has better performance than the traditional random forest by reducing approximately 12.75%, 11.82%, 12.93%, and 11.68% and selecting 99, 106, 113, and 118 trees for the ECAPRAF–EN algorithm.

Keywords:

random forest; machine learning; ensemble learning; clustering; lasso; elastic net; group lasso; k-means

1. Introduction

Ensemble learning is a powerful tool for the classification and prediction of various issues that has been extensively studied in statistics and machine learning. These methods use several algorithms to enhance the performance of the model and increase the prediction accuracy so that several weak algorithms are combined with a special pattern to create a strong learner that performs better than a single learner. The most well-known algorithms of ensemble learning are bagging, boosting, and random forest (RF). Bagging [1] produces regression trees by random selection with replacement. In addition, the generated trees do not appertain to previous trees, and each one is independent from its peers. These deep trees grow without pruning, hence a high variance and low bias exist in each tree, whose average causes low variance. Random forest is a developed form of bagging that is used for both regression and classification. Some methods similar to RF discussed in the literature do not work as well as the RF proposed by Breiman [2]. Dietterich [3] proposed a method in which each node is randomly selected from the k best splits. Amit et al. [4] suggested the first random-selection algorithm based on the best separators in which k trees are generated by the random vector. In the RF, a subset of predictors is selected as a separator in each node. Each node is grown and split by randomly selecting attributes.

The number of trees in the RF has always been a concern to researchers. A large number of trees does not always mean good prediction accuracy, but in some cases, causes increases in error, overfitting, and decreases in accuracy. By using RF, Khan et al. [5] proposed a method in which less OOB error determined the most optimal RF trees. To overcome the problem of a large number of trees, we propose a method in which shrinkage methods, which perform variable selection, are used to reduce the number of RF trees. This method not only increases accuracy but also prevents overfitting. In our method, from all the RF trees existing in the prediction, a subset of trees is selected using shrinkage methods. In [6], the post-selection boosting random-forest algorithm was suggested, which utilizes lasso regression to reduce the number of RF trees and improve the performance of the RF algorithm. On the other hand, clustering the correlated data, identifying the homogeneous subsets of predictors, and assigning them to similar clusters can be effective at increasing the prediction accuracy. For this reason, k-means clustering is used in this paper.

Major Contribution: In present paper, we propose a hybrid approach called ensemble of clustered and penalized random forest (ECAPRAF), which is shown in Figure 1. This method has four fundamental parts. These parts include clustering, predicting, shrinking, and ensembling. The purpose of using the clustering method is to identify homogeneous subsets of data and assign them to similar groups. This allows similar data placements in a cluster to reduce variance and increase correlation within clusters, as well as to increase the prediction accuracy in order to create an optimal model. For this purpose, k-means clustering is used. The second part, which is related to the prediction algorithm, is responsible for predicting the initial trees with low error. The main idea is to use a set of weak predictors with high accuracy to make initial decisions that can be summed together to obtain the final prediction. To do this, RF is used. The reason to choose RF as a prediction algorithm is to produce low-variance trees and use them as an input for shrinkage methods. The next part is to use the shrinkage methods to reduce the number of RF trees within each cluster in order to perform the final prediction in each cluster without an initial selection of trees. The accuracy of the model can be increased compared to traditional RF. In the last part, the remaining trees are ensembled to improve the power of the learners. This is performed using the weighted mean of the trees in the clusters. Averaging reduces the variance of each tree, and if one learner is weak, then the other learners correct it or reduce the error. Finally, it can be stated that this method provides an efficient algorithm to increase the prediction accuracy, improve the traditional RF algorithm, and automatically reduce the computational time and the number of RF trees within clusters.

Motivation: The purpose of this study is to provide a combination of statistical methods and machine-learning algorithms in order to increase accuracy and decrease the computational time. This is done by applying shrinkage methods such as lasso, elastic net and group lasso to random-forest trees and ensembling the remaining trees. In addition, attempts are made to reduce the variance and increase the correlation within the clusters by clustering the data and homogenizing them, which is performed in order to create an optimal model and increase the prediction accuracy.

Paper Organization: The rest of the paper is organized as follows: Section 2 examines previous works in ensemble learning, shrinkage methods, machine-learning algorithms, and the combination of them. In Section 3, a brief theory of random forest, the k-means algorithm, and the shrinkage methods are introduced. Section 4 describes the main formwork and a summary of the proposed method. The simulation study to evaluate the performance of the proposed model is applied in Section 5. In Section 6, the experimental results are analyzed and discussed based on two real datasets. Finally, Section 7 and Section 8 include discussion and conclusions and future work, respectively.

2. Literature Review

In many articles, ensemble-learning-based methods are used for prediction and defect diagnosis. For example, in [7] and [8] they were used to find bridge defects and software defects, respectively. In [7], ensemble methods were used to predict bridge defect conditions to help bridge managers make more rational and informed steel-bridge-maintenance decisions. For this purpose, six ensemble-learning models, namely, random forest, ExtraTree, AdaBoost, GBDT, XGBoost, and LightGBM were used. In [8], random forest, ExtraTree, AdaBoost, gradient boosting, histogram-based gradient boosting, XGBoost and CatBoost methods were used for the prediction of software defects and the automatic identification of defective parts of the software.

Ensemble-learning algorithms are not only used in machine learning but are also widely used in deep leaning. For instance, in [9], a combination model of CNN and SVM was presented, which used the SVM as an ensemble method to aggregate the CNNs. In [10], six algorithms, namely, ANNS, ANNN, ridge regression, lasso regression, MLR, and elastic-net regression were used to build a model that can be used to predict rock tensile strength. In [11], the least absolute shrinkage and selection operator (Lasso) and ridge regression in conjunction with the logistic-regression (LR) method were employed for feature selection. Then, classification algorithms such as KNN, random forest (RF), and logistic regression were used for predicting the results. This study presented a novel hybrid model for the diagnosis and prediction of liver cancer.

Researchers in various fields have recently focused on the combination of shrinkage methods and machine learning to improve the performance of traditional algorithms. In [12], clustering and coefficient estimation were simultaneously performed using cluster correlation–network support vector machine, in which clusters were penalized by Lasso, SCAD, and MCP. Previously, the combination of elastic net and SVM [13], elastic net and RF [14], as well as SVM with lasso, ridge, and SCAD [15,16] were used. In [17,18], a combination of variable clustering and feature selection was also used to reduce the dimension. In addition, hierarchical clustering and variable selection of the main variables were used to evaluate the performance of RF. Tutz and Koch [19] enhanced nearest-neighbor classifiers by using selection methods such as lasso or boosting. The relevant nearest neighbors were automatically selected. Bouveyron and Brunet [20] proposed a method that adapted the traditional mixture model for modeling and classifying data in a latent discriminative subspace. To generate the proposed discriminative latent mixture (DLM) model, the model-based clustering goals and the discriminative criterion, introduced by Fisher, were combined. Farhadi et al. [21], by using shrinkage methods such as lasso, ridge, and elastic net on simple linear regression showed that elastic net had better performance even though it was not combined with machine-learning algorithms. In the present paper, we show that elastic net in combination with machine-learning algorithms has a better performance compared to other shrinkage methods.

3. Materials and Methods

The proposed methodology consists of four main stages in total:

Clustering a dataset by employing the k-means clustering algorithm to identify homogeneous subsets of data and assign them to similar groups;
Using RF algorithm within each cluster as a predictor;
Reducing the number of trees to increase model accuracy and decrease computational time in test phase;
Ensembling the remaining trees within each cluster.

Figure 1 shows a flowchart of ECAPRAF. More details of the proposed model are explained in the next sections.

3.1. Prediction

3.1.1. K-Means Clustering

For many years, the main topic of data-mining studies has been cluster analysis, which has been a useful tool in data science. In addition, it is a well-known method for identifying homogeneous subsets from predictors in a dataset. K-medoids, k-means, and other clustering algorithms are widely used in many statistical analyses. K-means clustering is one of the most common and simple unsupervised learning methods in machine learning, which is used to solve various problems in statistics, computer science, genetics, and engineering. This algorithm partitions the dataset into K distinct and non-overlapping clusters. Therefore, the data within clusters have similar characteristics while between clusters they have different characteristics. The data within clusters are grouped based on similarity and squared Euclidean distance criteria. The input parameter in the k-means algorithm, which is used to determine the number of clusters, requires an optimal value for k, which is obtained using the Gap statistic [22] in the NbClust package [23]. After specifying the number of clusters, the k-means algorithm dedicates each observation to exactly one of k clusters. The minimization of the residual sum squares (RSS) can be used to determine the center of the clusters. In principle, k-means clustering is an optimization problem aimed at minimizing the objective function among groups. Although both clustering and classification divide the dataset into different classes, clustering, unlike classification, does not predict the variables; it just divides the dataset into homogeneous groups.

Suppose the dataset

D = \{x_{1}, x_{2}, \dots, x_{p}\}

contains N points. Suppose the clusters obtained after applying k-means are

C = \{C_{1}, C_{2}, \dots, C_{K}\}

. The RSS objective function to determine the value of the clusters is defined as follows:

R S S = \sum_{k = 1}^{K} \sum_{i, i^{'} \in C_{k}} x_{i j} - x_{i^{'} j}^{2}

(1)

where

|C_{k}|

,

C_{k}

,

x_{i j}

,

i \in C_{k}

, and K show the number of observations in the k-th cluster, the clusters, ij-th variable, the i-th observation in the k-th cluster, and the number of clusters, respectively.

By minimizing Equation (1), the minimization problem of the k-means clustering is solved as follows.

\begin{matrix} m i n i m i z e \\ C_{1}, C_{2}, \dots, C_{K} \end{matrix} \sum_{k = 1}^{K} \frac{1}{|C_{k}|} \sum_{i, i^{'} \in C_{k}} x_{i j} - x_{i^{'} j}^{2}

(2)

The k-means clustering steps are as follows:

The selection of the number of clusters;
The initialization of the cluster centers;
The data are assigned to the closest cluster based on the distance criterion. The data proximity is determined by their distance from the cluster centroids. The distance measure of all data from centroids within each cluster is the square Euclidean distance;
The centers of the clusters are the averages of the clusters and are updated with clusters obtained from the previous step. Additionally, there are other methods such as k-medoid and k-median that can be used to determine the centers of the clusters based on each method;
The two previous steps continue until the centers of the clusters do not change and the criterion of convergence is satisfied [24].

3.1.2. Random-Forest Algorithm

Although the decision-tree approach, suggested by Quinlan [25], can be more useful, it builds a very deep and complex tree that suffers from high variance and overfitting due to high depth. Therefore, it requires pruning techniques. Breiman [2] introduced “The random forest algorithm” to reduce overfitting in the decision tree. This reduction was made by building an ensemble of M trees [26]. To obtain M decision trees in the RF algorithm, suppose

X_{1}, X_{2}, \dots, X_{p}

and

y_{1}, y_{2}, \dots, y_{N}

are explanatory variables and response variables, respectively. A subset of features is randomly selected,

m t r y = \frac{p}{3}

. The predictor space is divided into J distinct and non-overlapping regions,

R_{1}, R_{2}, \dots, R_{j}

. Then, for every observation that falls into the

R_{j}

region, the same prediction is made, which is the mean of the response values for the observations in

R_{j}

. The predictor and cut point are selected. Finally, a tree is obtained that has minimum RSS. The aim of building the decision trees in the RF is to minimize the equation

R S S (j, s) = \sum_{i = 1}^{N_{1}} \sum_{x_{i} ϵ R_{1}} {(y_{i} - {\hat{y}}_{R_{1}})}^{2} + \sum_{i = 1}^{N_{2}} \sum_{x_{i} ϵ R_{2}} {(y_{i} - {\hat{y}}_{R_{2}})}^{2}

(3)

where

R_{1} = \{x_{i} | x_{i} \leq s\}

and

R_{2} = \{x_{i} |x_{i}〉 s\}

.

{\hat{y}}_{R_{2}} = \frac{1}{N_{2}} \sum_{x_{i} > s} y_{i}

and

{\hat{y}}_{R_{1}} = \frac{1}{N_{1}} \sum_{x_{i} \leq s} y_{i}

are the mean response for the training observations in

R_{2}

and

R_{1}

, respectively.

N_{1}

,

N_{2,}

and s are the number of samples in

R_{2}

,

R_{1}

, and the cut point, respectively.

Figure 2 shows the random-forest algorithm, the steps of which can be expressed as follows:

Generate bootstrap datasets $D_{1}, D_{2}, \dots, D_{M}$ from the original D dataset;
Construct trees according to the bootstrap dataset. (In this step the tree growth is without pruning. The mtry node is selected from the predictors as a separator).
Generate M trees $T_{1}, T_{2}, \dots, T_{M}$ .
Extract M predicted tree $T_{1} (z), \dots, T_{M} (z)$ .
Final prediction for the whole M regression tree as:

$\bar{y} = \frac{1}{M} \sum_{i = 1}^{M} T_{i} (z)$

(4)

3.2. Shrinkage

3.2.1. Lasso Regression

Variable selection is the basis of statistical learning that can be performed using shrinkage methods. The aim of these methods, such as lasso, elastic net, and group lasso, is to select some of the variables to make a new model. In this paper, these methods are used to reduce the number of RF trees to enhance the RF performance. In cases where the number of RF trees exceeds the number of observations

(N t r e e > N)

, shrinkage methods can be used to reduce the number of trees.

Lasso is one of the most common variable-selection methods, introduced by Tibshirani [27]. Lasso regression reduces the residual sum of squares and minimizes the prediction error where the sum of the absolute values of regression coefficients are less than the constant value of t. This method is used in high-dimensional data as well as in multi-collinearity [28]. In the variable-selection process in which Lasso is used, some coefficients may be estimated at exactly zero and the non-zero coefficients stay in the model.

Lasso tries to find the regression model that leads to the minimum residual sum of squares. The coefficients are limited by applying constraints to the regression model. This penalty is dependent on the tuning parameter of

λ

. Suppose

(X, Y)

is a dataset so that

X = {(x_{1}, \dots, x_{p})}^{'}

is the prediction variable and

Y

is the response variable. The objective function is as follows:

{\hat{β}}^{l a s s o} = \begin{matrix} a r g m i n \\ β \end{matrix} \{Y - X β^{2} + λ β_{1}\}

(5)

where

β_{1} = \sum_{j = 1}^{p} |β_{j}|

is

ℓ_{1}

-norm of

β

;

β

is the linear-regression coefficient;

λ \geq 0

is the tuning parameter that controls the amount of contraction. The value of

λ

is selected based on cross-validation. If

λ

is set to zero, then the lasso estimator will be the same as OLS. Generally, the increase in

λ

causes many coefficients to become zero [29].

3.2.2. Elastic Net Regression

Although the lasso regression carries out the variable selection, it has some defects. The first defect is the number of features that the lasso regression selects. When the dataset has high dimensionality, lasso selects features less than N because the sample size has been limited by the lasso regression. The second defect is the selection of correlated features. The lasso regression selects only some of the related features, while it is expected to select all correlated features or remove them all. In [30,31], the elastic-net regression was introduced to overcome these defects. It is a combination of two methods: lasso regression (

ℓ_{1}

-norm) [27] and ridge regression (

ℓ_{2}

-norm) [32]. The

ℓ_{2}

-norm part of the penalty makes a sparse model by shrinking some regression coefficients toward zero. On the other hand, the

ℓ_{1}

-norm part of the penalty removes some number of the selected variables. In addition, the weight between the two tuning parameters is determined by

α

so that the elastic net is converted into ridge regression at

α = 1

and is changed into lasso regression at

α = 0

. Therefore, the

ℓ_{2}

and

ℓ_{1}

penalties create a range of tuning parameters between 0 and 1

(i . e ., 0 \leq α \leq 1)

.

Suppose

(X, Y)

is a dataset so that

X = {(x_{1}, \dots, x_{p})}^{'}

is the prediction variable and

Y

is the response variable. Elastic net uses the combination of the

ℓ_{2}

and

ℓ_{1}

penalties, which can be defined as follows:

{\hat{β}}^{e l a s t i c} = \begin{matrix} a r g m i n \\ β \end{matrix} \{\frac{1}{2} Y - X β^{2} + λ [\frac{1}{2} (1 - α) β_{2}^{2} + α β_{1}]\}

(6)

where

β_{1} = \sum_{j = 1}^{p} |β_{j}|

and

β_{2}^{2} = \sum_{j = 1}^{p} β_{j}^{2}

are the

ℓ_{1}

-norm and

ℓ_{2}

-norm of

β

, respectively.

λ \geq 0

is a tuning parameter that is selected based on cross-validation.

3.2.3. Group-Lasso Regression

In machine-learning problems, methods such as group lasso can be used to find groups of variables. Group lasso [33,34] is another popular method for variable selection, which uses the

ℓ_{1}

-norm for the selection and shrinkage of variables. This method is a generalization of lasso which performs the group-wise variable selection, applies the group constraints to variables, and estimates coefficients as a group due to the group structure. Consequently, the group of coefficients becomes either zero or non-zero. In contrast, lasso and elastic net perform shrinkage for each variable. The parameter vector

β

is divided into

G_{1}, G_{2}, \dots, G_{q}

groups where

\cup_{j = 1}^{q} G_{j} = \{1, 2, \dots, p\}

. The vector

β

is

β = {(β_{G_{1}}, β_{G_{2}}, \dots, β_{G_{q}})}^{'}, β_{G_{j}} = \{β_{r}; r \in G_{j}\}

Suppose

(X, Y)

is a dataset so that

X = {(x_{1}, \dots, x_{p})}^{'}

is the prediction variable and

Y

is the response variable. The group lasso estimator is defined as follows:

{\hat{β}}^{g l a s s o} = \begin{matrix} a r g m i n \\ β \end{matrix} \{\frac{1}{n} Y - X β^{2} + λ \sum_{j = 1}^{q} m_{j} |β_{G_{j}}|\}

(7)

where

m_{j}

is a coefficient to create balance in groups of different sizes. The

m_{j}

is selected as

\sqrt{T_{j}}

where

T_{j}

is the cardinality

|G_{j}|

and

λ

is the tuning parameter that controls the amount of regularization [35].

4. Structure of Ensemble of Clustered and Penalized Random-Forest Model

In this section, we present a combination of machine-learning algorithms and statistical methods that reduce RF trees and increase prediction accuracy. We called this method ECAPRAF which is an optimization problem to obtain the optimal value of regression coefficients defined on RF trees and reduce them within clusters. The coefficients are obtained by solving the following equation:

\begin{matrix} m i n i m i z e \\ β_{k} \end{matrix} \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} {({\bar{y}}_{k} (z_{u}) - T_{k} (z_{u}) β_{k})}^{2} + p_{λ} (β_{t})

(8)

where

p_{λ} (β_{t})

can be the lasso, elastic-net, or group-lasso penalty, with the lasso penalty given in Equation (11) of the theorem.

β_{k}

is the regression coefficient defined for RF trees in the k-th cluster,

T_{k} (z_{u})

is the RF trees in the k-th cluster, and

{\bar{y}}_{k} (z_{u})

is the mean of the predicted values in RF. The major problem in solving Equation (8) is that it does not have a closed-form solution. To overcome this, numerical methods can be used. For this purpose, clusters are kept fixed and Equation (8) is solved based on

β_{k}

. According to the proposed model, the theorem below is presented to estimate and prove the optimal

β_{k}

.

Theorem:

SupposePRSS(β_k) is the penalized residual sum of the squares:

P R S S (β_{k}) = \frac{1}{2} \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} {({\bar{y}}_{k} (z_{u}) - T_{k} (z_{u}) β_{k})}^{2} + p_{λ} (β_{t})

(9)

p_{λ} (β_{t}) = \sum_{k = 1}^{K} \sum_{t = 1}^{M_{k}} λ_{k} |β_{t}|

(10)

and

p_{λ} (β_{t})

is the Lasso penalty. The optimal value for

β_{k}

that minimize

P R S S (β_{k})

is obtained by solving the following equation:

\sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} (\frac{1}{M_{k}} \sum_{n = 1}^{M_{k}} T_{k} (z_{u}) T_{j_{k}} (z_{u})) + \sum_{k = 1}^{K} λ_{k} s i g (β_{j_{k}}) = \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} T_{k} (z_{u}) T_{j_{k}} (z_{u}) β_{k}

(11)

Proof:

P R S S (β_{k}) = \frac{1}{2} \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} {({\bar{y}}_{k} (z_{u}) - T_{k} (z_{u}) β_{k})}^{2} + \sum_{k = 1}^{K} \sum_{t = 1}^{M_{k}} λ_{k} |β_{t}| = \frac{1}{2} \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} {(\frac{1}{M_{k}} \sum_{n = 1}^{M_{k}} T_{k} (z_{u}) - [1, T_{1_{k}} (z_{u}), \dots, T_{M_{k}} (z_{u})] [\begin{matrix} \begin{matrix} β_{0} \\ β_{1_{k}} \\ ⋮ \end{matrix} \\ β_{M_{k}} \end{matrix}])}^{2} + \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} λ_{k} |β_{t}| = \frac{1}{2} \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} {(\frac{1}{M_{k}} \sum_{n = 1}^{M_{k}} T_{k} (z_{u}) - β_{0} - T_{1_{k}} (z_{u}) β_{1_{k}} - \dots - T_{M_{k}} (z_{u}) β_{M_{k}})}^{2} + \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} λ_{k} |β_{1_{k}} + \dots + β_{M_{k}}|

\frac{\partial P R S S (β_{k})}{\partial β_{j_{k}}} = - \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} (\frac{1}{M_{k}} \sum_{n = 1}^{M_{k}} T_{k} (z_{u}) - T_{k} (z_{u}) β_{k}) T_{j_{k}} (z_{u}) + \sum_{k = 1}^{K} λ_{k} s i g (β_{j_{k}}); j = 1, \dots, M_{k}

If

\frac{\partial P R S S (β_{k})}{\partial β_{j_{k}}} = 0

Then

\frac{\partial P R S S (β_{k})}{\partial β_{j_{k}}} = - \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} (\frac{1}{M_{k}} \sum_{n = 1}^{M_{k}} T_{k} (z_{u}) T_{j_{k}} (z_{u})) - \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} T_{k} (z_{u}) T_{j_{k}} (z_{u}) β_{k} + \sum_{k = 1}^{K} λ_{k} s i g (β_{j_{k}}) = 0; j = 1, \dots, M_{k}

\sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} (\frac{1}{M_{k}} \sum_{n = 1}^{M_{k}} T_{k} (z_{u}) T_{j_{k}} (z_{u})) + \sum_{k = 1}^{K} λ_{k} s i g (β_{j_{k}}) = \sum_{k = 1}^{K} \sum_{u = 1}^{d_{k}} T_{k} (z_{u}) T_{j_{k}} (z_{u}) β_{k}

β_{k}

does not have a close form and can be solved using numerical methods. □

Figure 3 shows the hybrid structure of the proposed model (ECAPRAF). As shown in the figure, the dataset

\{(x_{1}, y_{1}), \dots, (x_{N}, y_{N})\}

is clustered into k clusters

C_{1}, \dots, C_{k}

to identify the homogeneous subset of data. Within each cluster, the data are divided into the training set

D = \{(x_{1}, y_{1}), \dots, (x_{k}, y_{k})\},

and the test set

Z = \{z_{1}, z_{2}, \dots, z_{d}\}

, which are used for training and evaluation, respectively. The

M_{k}

trees are trained by a training set in which

T_{j_{k}} (z_{u})

,

j_{k} = 1, \dots, M_{k}

,

u = 1, \dots, d_{k}

are obtained from RF and then are evaluated by a test set. Finally, the

T_{k} (z_{u}) = [1, T_{1_{k}} (z_{u}), \dots, T_{M_{k}} (z_{u})]

matrix, which is generated from the output of the RF trees, represents the explanatory variables. The predicted results, obtained from the RF trees, are

{\bar{y}}_{k} = \frac{1}{M_{k}} \sum_{n = 1}^{M_{k}} T_{n_{k}} (z_{u}), k = 1, \dots, K

. This step produces the predicted trees to perform the next stage. The constructed trees are used as the inputs of the lasso, elastic-net, and group-lasso regressions with the response variables

{\bar{y}}_{k}

to automatically incorporate these trees into the model.

Suppose that in the k-th cluster, these trees form the linear regression with the independent variable

T_{k}

and dependent variable

{\bar{Y}}_{k}

. The linear-regression model corresponding to the cluster is:

{\bar{Y}}_{k} = T_{k} (z_{u}) β_{k} + ε_{k}; ε_{k} ~ N (0, σ_{k}^{2}); k = 1, \dots, K

(12)

where

β_{k} = {(β_{1_{k}}, β_{2_{k}}, \dots, β_{M_{k}})}^{'}

is the regression coefficient. The purpose is to optimize and estimate the coefficients of trees so that some of them are inclined toward zero or removed from the model using shrinkage methods. Consequently, some of the coefficients become zero and are removed from the model. Reducing trees creates a more appropriate and efficient model than the traditional RF. In principle, it reduces error and enhances the model performance. Finally,

m_{k}

trees are selected in each cluster. The weak learners are corrected by other strong learners and the error is reduced by averaging so that the sum of the whole trees of clusters becomes m.

To designate the value of the tuning parameter

λ_{k}

in Equation (11), 10-fold cross-validation is implemented to gain the best

λ_{k}

within clusters that gives the minimum error. In the case where no prior information is available for the value of k, the number of groups can be specified using formal statistical tests. One of them is the Gap statistic [22], which is as follows:

G a p (K) = \frac{1}{B} \sum_{b = 1}^{B} l o g W_{k b} - l o g W_{k}

(13)

where B is the number of observation datasets,

W_{k b}

is the within-dispersion matrix and

W_{k}

is the within-group dispersion matrix. The optimal number of clusters is selected based on the minimum k. Several unsupervised-clustering test statistics related to this issue include the Gap index, Gamma index, and Friedman index. They are available in two packages, the NbClust package [23] and the cluster in R software.

5. Simulation Study

In this section, the performance of the proposed ECAPRAF through two real datasets and a simulation study is investigated. Then, the prediction performance of ECAPRAF with RF is compared. In addition, the proposed method is compared with ECAPRAF–Lasso, ECAPRAF–EN, and ECAPRAF–GL.

5.1. Simulation Study Design

In this subsection, a Monte Carlo simulation study is conducted to assess the performance of RF and the proposed model. All computational procedures are conducted by R software. This is performed in four parts aiming to improve the performance of RF. The steps include k-means clustering to identify homogeneous groups of data, the RF algorithm to predict trees, and shrinkage methods to reduce the number of trees. Finally, the ensembling of the remaining trees is performed to aggregate them. As a result, the error rate is reduced and the performance of the traditional RF algorithm is improved. In the end, the proposed hybrid algorithm and the RF algorithm are evaluated in terms of the number of selected trees and the error criteria.

In this study, we assume that the simulation dataset includes N = 500 random samples and p = 4 predictor variables for the linear model. The simulation dataset is partitioned into different clusters using k-means clustering for homogenization. The within-cluster data are grouped based on their similarity so that there is less dispersion and high correlation within clusters and less correlation between clusters. Within clusters, 80% of the dataset is selected for the training set, and the rest is chosen for the testing set. In the RF algorithm, Ntree = 300, 500, 800, and 1000 are considered for the total number of trees. Inside each cluster, 100, 240, 200, and 300 trees are considered for the first cluster, and 200, 260, 600, and 700 trees for the second cluster. The details of the hyper-parameters that are used in this framework are given in Table 1.

The linear-regression model is defined as Equation (14), mentioned in Wang’s paper. The variables

x_{1}, x_{2}, x_{3}, and x_{4}

are generated from the standard normal distribution

N (0, 1)

, which follows the regression model below:

y = x_{1} - 9 x_{2} - 3 x_{3} - 7 x_{4} - 4 + ε_{1}

(14)

where

ε_{1}

obeys the normal distribution

N (0, σ_{1}^{2})

. The value of

σ_{1}^{2}

is equal to

\frac{1}{3}

of the standard deviation of

x_{1} - 9 x_{2} - 3 x_{3} - 7 x_{4} - 4

.

In the first step, the dataset is partitioned into two clusters based on k-means clustering. The 3D scatter plot is drawn in Figure 4. It shows the number of clusters and the scattering of points in three dimensions. Each cluster is shown in a different color. As can be seen, two clusters are obtained for the simulation dataset, highlighted in blue and red. The first cluster with a red color contains 256 samples and the second one with a blue color contains 244 samples. The dataset consists of four features from which three features

x_{1}, x_{2}, and x_{3}

are used to draw a diagram.

In the second step, in each cluster, the RF algorithm is used to predict the initial decision trees with high accuracy and low variance, which are the inputs of lasso, elastic net, and group lasso. The defined linear-regression model for the prediction trees of each cluster is as follows:

{\bar{Y}}_{k} = T_{k} (z_{u}) β_{k} + ε_{k}; ε_{k} ~ N (0, 0.05); k = 1, 2

(15)

where

T_{k} (z_{u})

is the RF trees,

{\bar{y}}_{k} = \frac{1}{M_{k}} \sum_{n = 1}^{M_{k}} T_{n_{k}} (z_{u}), k = 1, \dots, K

is the mean of trees obtained from the RF algorithm and

β_{k}

is the linear-regression coefficient. Each

β_{k}

is changed with the number of trees in each cluster. The values of

β_{1}

and

β_{2}

for 500 trees are equal to 240 and 260 trees, respectively:

β_{1} = {(\underset{s = 90}{\underset{⏟}{0.01, 0.01, 0.02, 0.02, \dots, 0.08, 0.08, 0.09, 0.09}}, \underset{p - s = 150}{\underset{⏟}{0, \dots, 0}})}^{'} β_{2} = {(\underset{s = 90}{\underset{⏟}{0.01, 0.01, 0.02, 0.02, \dots, 0.08, 0.08, 0.09, 0.09}}, \underset{p - s = 170}{\underset{⏟}{0, \dots, 0}})}^{'}

In the next step of the proposed algorithm, the lasso, elastic-net, and group-lasso regressions are applied to shrink the tree output of each cluster. These methods are described in Section 3. Finally, the number of trees is automatically selected in each cluster without prior selection. Some trees are removed, and the model is estimated based on the remaining trees. In the last step, the remaining trees are ensembled to reduce the error rate. The obtained model from the combination of machine-learning algorithms and statistical methods improves the performance of the traditional RF algorithm as well as the proposed algorithm [6].

5.2. Simulation Results

In this subsection, the performance of our approaches, described in Section 5.1, is compared with RF. The results of the methods are calculated with 500 repetitions to evaluate the prediction accuracy using mean-squared error (MSE), root-mean-squared error (RMSE), mean absolute error (MAE), weighted mean-squared error (WMSE), weighted root-mean-squared error (WRMSE), and weighted mean absolute error (WMAE), which are used to assess the results as follows:

M S E = \frac{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}{N}

(16)

R M S E = \sqrt{\frac{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}{N}}

(17)

M A E = \frac{\sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|}{N}

(18)

W M S E = \frac{\sum_{i = 1}^{N} w_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} w_{i}}

(19)

W R M S E = \sqrt{\frac{\sum_{i = 1}^{N} w_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} w_{i}}}

(20)

W M A E = \frac{\sum_{i = 1}^{N} w_{i} |y_{i} - {\hat{y}}_{i}|}{\sum_{i = 1}^{N} w_{i}}

(21)

where

{\hat{y}}_{i}

,

y_{i}

and

w_{i}

are the predicted values, the true values of the

i

-th sample, and the number of samples in the

k

-th cluster, respectively.

In Figure 5a,c, which show the relationship between MSE and log (

λ

), the red dots indicate the cross-validation error, and the top and bottom lines display the standard deviation for ECAPRAF–EN. The left and right vertical lines show the

λ_{m i n}

and

λ_{1 s e}

, respectively. The

λ_{m i n}

shows the minimum lambda with the lowest error and the

λ_{1 s e}

represents the lambda value based on the standard error. The above characters show the number of non-zero coefficients in each cluster.

Figure 5b,d show the ECAPRAF–EN regression coefficients in each cluster. To the best of our knowledge, different variable coefficients have various effects on the response variable. The first variable entered into the model is the most effective one, and the subsequently entered variables have different effects. Finally, the coefficients of ineffective variables on the model are eliminated. This is true for lasso and group lasso, too.

According to Figure 5a,c,

\log (λ)

, 58 trees out of 240 and 48 trees out of 260 are selected from the first and second clusters, respectively. In the total of two clusters, 106 trees out of 500 are selected in ECAPRAF–EN. The left vertical dashed line refers to

λ_{m i n} = 0.6891

with the minimum

M S E = 8.4355

and

λ_{m i n} = 6.6574

with the minimum

M S E = 8.6256

for the first and second cluster, respectively. Moreover, the bar plot of the predicted error for 500 trees is presented in Figure 5, which shows the lowest value of WRMSE for ECAPRAF–EN.

According to Table 2, the results of WMSE, WRMSE, and WMAE are obtained from the simulation of a linear-regression model for 300, 500, 800, and 1000 trees, from which 99, 106, 113, and 118 trees are selected by the ECAPRAF–EN algorithm, respectively. As can be seen, the weighted MSE for the proposed ECAPRAF–EN model is equal to 8.5453, in which 106 trees out of 500 are selected. In other words, it is less than RF and other shrinkage methods. Additionally, according to the weighted MAE and RMSE, the performance of the proposed ECAPRAF–EN is better than other methods. Generally, the proposed algorithm is better than the RF algorithm. Concerning the number of trees, although the total number of trees selected by ECAPRAF–EN is more than by ECAPRAF–Lasso, the values of WMSE, WRMSE, and WMAE are the lowest compared to the others.

The simulation results of the accuracy measurements, i.e., MSE, RMSE, and MAE, are reported in Table 3 for each cluster. As can be seen, similar to the results of WRMSE in Table 2 and Figure 6, the RMSE value of ECAPRAF–EN in each cluster is the lowest value among the three algorithms. Elastic net also has the highest reduction compared to other shrinkage methods. Therefore, it can be concluded that both within clusters and for the sum of two clusters, the ECAPRAF–EN algorithm has the lowest error and the highest accuracy among the other proposed methods.

As shown in Table 3, the number of RF trees used within some clusters may be the same, but the number of trees reduced by shrinkage methods can be different. This leads to different prediction accuracies. The reason for this can be the amount of correlation and the homogeneity of the data within the clusters.

Generally, it can be said that the trees produced from the RF due to high variance and low bias can be associated with high error and low accuracy [36]. As a result, the usage of shrinkage methods can help to improve the performance of RF. This improvement varies in different methods. For example, in lasso regression, if the number of trees is bigger than the observations, then the number of selected trees will be less than the observations. There is no such defect in the elastic-net regression. Although a larger number of trees is selected, it has less error and greater accuracy than lasso. Concerning group lasso, although group selection removes a group of trees and can create a better model than other methods, it removes the trees that lead to the improvement of the model performance.

6. The Real Data Analysis

The performance of the random-forest, ECAPRAF–Lasso, ECAPRAF–EN, and ECAPRAF–GL algorithms are described through two real datasets. The utilized datasets contain Boston house prices and real-estate valuations. The basic information of the two datasets is shown in Table 4. The Boston house-price dataset contains 506 observations and 13 variables whose response variable is the median of the owner-occupied house price. The data were first published by Harrison and Rubinfeld [37] and are publicly available through the MASS package in R (https://cran.r-project.org/package=MASS (accessed on 7 May 2021)). Another dataset includes real-estate valuations from the UCI machine-learning repository (accessed on 7 May 2021). The original owners of this dataset are Yeh and Hsu [38], and it consists of seven variables from which five variables are selected including house age, distance to the nearest MRT station, number of convenience stores in the living circle on foot, latitude, longitude, the transaction date, and the house price of the select unit area as the independent variable and the house price of the unit area as the dependent variable. In the simulation section, the datasets are first clustered, then in each cluster, 80% of the data are selected for the training set, and the remaining datasets are selected for the testing set. In this study, glmnet, gglasso, NbClust, and randomForest packages in R are used.

In the real-estate valuation and Boston house-price datasets, the data are divided into three and two clusters in which different trees are used in each cluster, respectively. In the RF algorithm, Ntree = 300, 500, 800, and 1000 are considered for the total number of trees of clusters and different trees in each cluster. The 3D scatter plot is drawn in Figure 7, which shows the number of clusters and the scattering of points in three dimensions. Each cluster is shown in a various color. As can be seen in Figure 7a, two clusters are obtained for the Boston house-price dataset, highlighted in blue and red. The first red-colored cluster contains 369 samples, and the other blue-colored cluster contains 137 samples. The dataset consists of 13 features from which 3 features, namely, the average number of rooms (rm), the pupil–teacher ratio by town (ptratio), and the weighted distances of employment centers (dis) are used to draw the diagram. For the real-estate-valuation dataset: three clusters colored blue, red, and green are identified including 41 samples in the first cluster, 93 in the second cluster, and 280 in the third cluster. The diagram is shown in Figure 7b.

The predicted results of the proposed technique (ECAPRAF) are shown for the real-estate-valuation dataset in Table 5 and the Boston house-price in Table 6. The WMSE, WRMSE, and WMAE are calculated for different trees. The WRMSE, WMAE, and WMSE, obtained for RF, are based on the Boston house-price dataset, 3.4435, 2.1903, and 11.8579, respectively. In contrast, the values obtained with the proposed technique for ECAPRAF–EN are 3.2218, 2.1515, and 10.3801, respectively. As can be seen, the proposed method has a 12% reduction compared to RF. There is also a significant reduction in real-estate valuation, the results are shown in Table 5. In these datasets, the ECAPRAF presents more acceptable results than the traditional RF. Additionally, ECAPRAF–EN yields more efficient results than other shrinkage methods. Additionally, not only does the ECAPRAF–EN select many trees compared to ECAPRAF–Lasso, but it also has a lower WMSE, WRMSE, WMAE, and a great performance. Other algorithms have a better performance than RF. Therefore, it is concluded that the ECAPRAF–EN algorithm improves the performance of RF by reducing the number of trees and also outperforms other algorithms. The performance of the proposed technique and RF in terms of WRMSE is indicated in Figure 8.

7. Discussions

In this paper, the ECAPRAF model is proposed to increase the RF algorithm accuracy and reduce the computational time. In the proposed model, the k-means clustering algorithm clusters the data. RF extracts the prediction trees. The shrinkage methods such as lasso, elastic net, and group lasso are used to reduce the number of random-forest trees. The generated trees are input into lasso, group lasso, and EN algorithms for tree reduction. Finally, the remaining trees are ensembled.

Our combined model achieved a higher accuracy in comparison to published study [6] that used only lasso to reduce the number of RF trees. In the present proposed model, after clustering the simulation result of the linear model with 500 trees, the MSE in our proposed model reached 8.56 compared to 21.64 in [6]. In the real datasets, i.e., Boston house price and real-estate valuation, the MSE reached 10.93 and 32.94 in our model, respectively, compared to 12.84 and 61.97 in [6]. Moreover, we used the elastic-net method instead of lasso, which resulted in the better performance of EN than lasso.

Using the RF trees as the input of the EN method to reduce the number of trees, ECAPRAF–EN achieved a 12% error reduction, which was the highest reduction compared to other methods. It means that the proposed model has much higher accuracy and smaller error compared with the traditional methods, which proves that the proposed model is effective and has better performance.

Although ECAPRAF–EN improved the accuracy of prediction compared with other methods, we will choose deep-learning methods and different ensemble methods to further study in future work in order to improve the model performance and prediction accuracy.

8. Conclusions and Future Work

In this research, we proposed a combination of machine-learning algorithms and statistical methods. This method consisted of four parts. In the first part, k-means clustering was used to identify k homogeneous clusters, which homogenizes the data and assigns them to similar clusters. In the second part, the data within the clusters were trained for the prediction of RF trees, and weak trees were created with high accuracy. In the third part, shrinkage methods such as lasso, elastic net, and group lasso were used to decrease the number of RF trees. They increased efficiency, reduced error, and improved the model performance. Finally, the remaining trees were ensembled to reduce variance and error. To evaluate the efficiency of the designed method, the ECAPRAF model was applied to both real and simulation data. The results were analyzed using MSE, RMSE, and MAE criteria for each cluster and WMSE, WRMSE, and WMAE criteria for the weighted sum of clusters. The simulation results showed that in all four states, ECAPRAF–EN, ECAPRAF–Lasso, and ECAPRAF–GL were reduced by about 12%, 11.5%, and 11% compared to RF, respectively. The real datasets showed about 12%, 7%, and 11% reduction, respectively. Among the three methods, including the ECAPRAF–EN, ECAPRAF–Lasso, and ECAPRAF–GL, ECAPRAF–EN provided better results and more trees within clusters. Concerning the number of trees, ECAPRAF–EN with 128 trees performed better than ECAPRAF–Lasso with 100 and ECAPRAF–GL with 105. Among the methods, ECAPRAF–EN has the best performance compared to other shrinkage methods. Therefore, it can be concluded from this study that our proposed model, which sought to increase the accuracy and improve the performance of traditional RF with clustering and shrinkage methods, is efficient.

As a future work, we will use the CNN algorithm instead of random forest. This will be done by replacing multiple parallel CNNs instead of random-forest trees. In other words, we will enhance the CNN network by using shrinkage methods.

Author Contributions

Conceptualization, Z.F. and H.B.; methodology, Z.F.; software, Z.F.; validation, H.B. and M.-R.F.-D.; formal analysis, Z.F.; writing—original draft preparation, Z.F.; writing—review and editing, H.B., M.-R.F.-D., M.F.I. and W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Boston house price: http://lib.stat.cmu.edu/datasets/boston (accessed on 7 May 2021); Real Estate Valuation: https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set (accessed on 7 May 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Dietterich Thomas, G. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Mach. Learn. 2000, 40, 139–157. [Google Scholar] [CrossRef]
Amit, Y.; Geman, D.; Wilder, K. Joint Induction of Shape Features and Tree Classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 1300–1305. [Google Scholar] [CrossRef] [Green Version]
Khan, Z.; Gul, A.; Perperoglou, A.; Miftahuddin, M.; Mahmoud, O.; Adler, W.; Lausen, B. Ensemble of Optimal Trees, Random Forest and Random Projection Ensemble Classification; Springer: Berlin/Heidelberg, Germany, 2020; Volume 14, ISBN 1163401900. [Google Scholar]
Wang, H.; Wang, G. Improving Random Forest Algorithm by Lasso Method. J. Stat. Comput. Simul. 2020, 91, 353–367. [Google Scholar] [CrossRef]
Li, Q.; Song, Z. Ensemble-Learning-Based Prediction of Steel Bridge Deck Defect Condition. Appl. Sci. 2022, 12, 5442. [Google Scholar] [CrossRef]
Alazba, A.; Aljamaan, H. Software Defect Prediction Using Stacking Generalization of Optimized Tree-Based Ensembles. Appl. Sci. 2022, 12, 4577. [Google Scholar] [CrossRef]
Liu, Y.; Yan, X.; Zhang, C.; Liu, W. An Ensemble Convolutional Neural Networks for Bearing Fault Diagnosis Using Multi-Sensor Data. Sensors 2019, 19, 5300. [Google Scholar] [CrossRef] [Green Version]
Hassan, M.Y.; Arman, H. Comparison of Six Machine-Learning Methods for Predicting the Tensile Strength (Brazilian) of Evaporitic Rocks. Appl. Sci. 2021, 11, 5207. [Google Scholar] [CrossRef]
Ali, M.A.S.; Orban, R.; Ramasamy, R.R.; Muthusamy, S.; Subramani, S.; Sekar, K.; Rajeena, P.P.F.; Gomaa, I.A.E.; Abulaigh, L.; Elminaam, D.S.A.; et al. A Novel Method for Survival Prediction of Hepatocellular Carcinoma Using Feature-Selection Techniques. Appl. Sci. 2022, 12, 6427. [Google Scholar] [CrossRef]
Kharoubi, R.; Oualkacha, K.; Mkhadri, A. The Cluster Correlation-Network Support Vector Machine for High-Dimensional Binary Classification. J. Stat. Comput. Simul. 2019, 89, 1020–1043. [Google Scholar] [CrossRef]
Wang, L.; Zhu, J.; Zou, H. The Doubly Regularized Support Vector Machine. Stat. Sin. 2006, 16, 589–615. [Google Scholar]
Wang, M.; Yue, L.; Cui, X.; Chen, C.; Zhou, H.; Ma, Q.; Yu, B. Prediction of Extracellular Matrix Proteins by Fusing Multiple Feature Information, Elastic Net, and Random Forest Algorithm. Mathematics 2020, 8, 169. [Google Scholar] [CrossRef] [Green Version]
Becker, N.; Toedt, G.; Lichter, P.; Benner, A. Elastic SCAD as a Novel Penalization Method for SVM Classification Tasks in High-Dimensional Data. BMC Bioinform. 2011, 12, 138. [Google Scholar] [CrossRef] [Green Version]
Chavent, M.; Genuer, R.; Saracco, J. Combining Clustering of Variables and Feature Selection Using Random Forests. Commun. Stat. Simul. Comput. 2018, 50, 426–445. [Google Scholar] [CrossRef] [Green Version]
Yassin, S.S.; Pooja. Road Accident Prediction and Model Interpretation Using a Hybrid K-Means and Random Forest Algorithm Approach. SN Appl. Sci. 2020, 2, 1576. [Google Scholar] [CrossRef]
Macqueen, J. Some Methods for Classification and Analysis of Multivarite Observation. In Proceeding of the 5th Berkeley Symposium on Mathematical Statistics and Probability; University of California: Los Angeles, CA, USA, 1967; Volume 281, p. 97. [Google Scholar]
Tutz, G.; Koch, D. Improved Nearest Neighbor Classifiers by Weighting and Selection of Predictors. Stat. Comput. 2016, 26, 1039–1057. [Google Scholar] [CrossRef]
Bouveyron, C.; Brunet, C. Simultaneous Model-Based Clustering and Visualization in the Fisher Discriminative Subspace. Stat. Comput. 2012, 22, 301–324. [Google Scholar] [CrossRef] [Green Version]
Farhadi, Z.; Arabi Belaghi, R.; Gurunlu Alma, O. Analysis of Penalized Regression Methods in a Simple Linear Model on the High-Dimensional Data. Am. J. Theor. Appl. Stat. 2019, 8, 185. [Google Scholar] [CrossRef] [Green Version]
Tibshirani, R.; Walther, G.; Hastie, T. Estimating the Number of Clusters in a Data Set via the Gap Statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 2001, 63, 411–423. [Google Scholar] [CrossRef]
Charrad, M.; Ghazzali, N.; Boiteau, V.; Niknafs, A. Nbclust: An R Package for Determining the Relevant Number of Clusters in a Data Set. J. Stat. Softw. 2014, 61, 1–36. [Google Scholar] [CrossRef]
Aldino, A.A.; Darwis, D.; Prastowo, A.T.; Sujana, C. Implementation of K-Means Algorithm for Clustering Corn Planting Feasibility Area in South Lampung Regency. J. Phys. Conf. Ser. 2021, 1751, 012038. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef] [Green Version]
Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2013; ISBN 9781107298019. [Google Scholar]
Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Liu, X.; Meng, X.; Wang, X. Carbon Emissions Prediction of Jiangsu Province Based on Lasso-BP Neural Network Combined Model. IOP Conf. Ser. Earth Environ. Sci. 2021, 769, 022017. [Google Scholar] [CrossRef]
Böheim, R.; Stöllinger, P. Decomposition of the Gender Wage Gap Using the LASSO Estimator. Appl. Econ. Lett. 2021, 28, 817–828. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
Van der Kooij, A.J. Regularization with Ridge Penalties, the Lasso, and the Elastic Net for Regression with Optimal Scaling Transformations. In Prediction Accuracy and Stability of Regression with Optimal Scaling Transformations; Leiden University: Leiden, The Netherlands, 2007; pp. 65–90. ISBN 9789090219363. [Google Scholar]
Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
Yuan, M.; Lin, Y. Model Selection and Estimation in Regression with Grouped Variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006, 68, 49–67. [Google Scholar] [CrossRef]
Meier, L.; Van De Geer, S. The Group Lasso for Logistic Regression. J. R. Stat. Soc. Ser. B 2006, 70, 53–71. [Google Scholar] [CrossRef] [Green Version]
Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations; Chapman & Hall: New York, NY, USA, 2015; ISBN 9781498712170. [Google Scholar]
James, G.; Witten, D.; Tibshirani, R.; Hastie, T. An Introduction to Statistical Learning with Applications in R; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Harrison, D.; Rubinfeld, D.L. Hedonic Housing Prices and the Demand for Clean Air. J. Environ. Econ. Manag. 1978, 5, 81–102. [Google Scholar] [CrossRef]
Yeh, I.C.; Hsu, T.K. Building Real Estate Valuation Models with Comparative Approach through Case-Based Reasoning. Appl. Soft Comput. J. 2018, 65, 260–271. [Google Scholar] [CrossRef]

Figure 1. Flowchart of ECAPRAF. A dataset undergoes clustering in order to identify homogeneous subsets of data and assign them to similar groups. Then, trees are extracted as predictors by random forest and reduced by shrinkage methods. Therefore, elimination through the shrinkage methods as a means of reduction leads to a decrease in error and an increase in the model accuracy. In the last part, the remaining trees are ensembled to improve the performance of the model.

Figure 2. The basic structure of random forest.

Figure 3. Detailed structure of ECAPRAF Model. First, dataset is clustered into K classes to identify the homogeneous subset of data and assign them to similar groups. Inside each cluster, the data are trained to produce RF trees for a more accurate prediction. The trees are reduced by shrinkage methods in order to reduce computational time. Finally, all clusters are ensembled.

Figure 4. 3D scatter plot shows the number of clusters and the relationship between variables with different colors on three axes. The dataset consists of four variables from which three variables

x_{1}, x_{2}, x_{3}

are plotted.

Figure 4. 3D scatter plot shows the number of clusters and the relationship between variables with different colors on three axes. The dataset consists of four variables from which three variables

x_{1}, x_{2}, x_{3}

are plotted.

Figure 5. (a,b) The simulation results of the algorithm in the first cluster. (c,d) The simulation results of the algorithm in the second cluster. (a,c) The left and right dashed lines show the

λ_{m i n}

and

λ_{1 s e}

, respectively. The

λ_{m i n}

shows the minimum lambda with the lowest error and the

λ_{1 s e}

represents lambda value based on the one-standard error. The upper part of the plot shows the number of non-zero coefficients in each cluster for a given

\log (λ)

. (b,d) The number of effective variables in the model.

Figure 5. (a,b) The simulation results of the algorithm in the first cluster. (c,d) The simulation results of the algorithm in the second cluster. (a,c) The left and right dashed lines show the

λ_{m i n}

and

λ_{1 s e}

, respectively. The

λ_{m i n}

shows the minimum lambda with the lowest error and the

λ_{1 s e}

represents lambda value based on the one-standard error. The upper part of the plot shows the number of non-zero coefficients in each cluster for a given

\log (λ)

. (b,d) The number of effective variables in the model.

Figure 6. Bar plot showing the difference among the proposed methods based on 500 trees. The RMSE value of ECAPRAF–EN is the lowest value among the three algorithms. Elastic net also has the highest reduction compared to other shrinkage methods.

Figure 7. 3D scatter plot shows the number of clusters and the relationship between variables with different colors on three axes. (a) The dataset consists of 13 features from which 3 features are the average number of rooms (rm), the pupil–teacher ratio by town (ptratio), and weighted distances of employment centers (dis). (b) The dataset consists of 5 features from which 3 features are longitude, latitude, and house age.

Figure 8. Bar plot of the difference in proposed methods based on 500 trees. The RMSE value of ECAPRAF–EN is the lowest value among the three algorithms. Elastic net also has the highest reduction compared to other shrinkage methods.

Table 1. Parameter setting for simulation study.

Hyper-Parameters	Description	Setting
mtry	Number of drawn candidate variables in each split	$p / 3$ for regression
Sample	Number of observations	N = 500
Features	Predictor variables for the linear model	p = 4
Number of clusters	different clusters using k-means clustering	K = 2
Number of total trees	Number of trees in the forest	300, 500, 800, 1000
Number of cluster trees	Number of trees in the first cluster	100, 240, 200, 300
Number of cluster trees	Number of trees in the second cluster	200, 260, 600, 700
Iteration	Number of repetitions of simulation	500

Table 2. The results of the proposed method for the linear model.

	Algorithm	WMSE	WRMSE	WMAE	Selected Trees
Ntree = 300	Random Forest	9.907040504	3.135776591	2.457571958	300
	ECAPRAF–Lasso	8.667368271	2.944039448	2.307557453	84
	ECAPRAF–EN	8.643681852	2.94001392	2.304445756	99
	ECAPRAF–GL	8.769322968	2.961304268	2.325545926	135
Ntree = 500	Random Forest	9.690688581	3.102433591	2.435397318	500
	ECAPRAF–Lasso	8.565587148	2.926702436	2.292483911	83
	ECAPRAF-EN	8.545308173	2.923235908	2.288376565	106
	ECAPRAF–GL	8.616568	2.935399121	2.30589166	175
Ntree = 800	Random Forest	9.855558666	3.127227327	2.44665839	800
	ECAPRAF–Lasso	8.621290954	2.936203493	2.293232464	90
	ECAPRAF–EN	8.581137457	2.929357857	2.288603167	113
	ECAPRAF–GL	8.628389783	2.937412089	2.302874334	160
Ntree = 1000	Random Forest	9.694166557	3.104002527	2.428734774	1000
	ECAPRAF–Lasso	8.591221483	2.931078553	2.29209065	103
	ECAPRAF–EN	8.561964742	2.926083516	2.287156371	118
	ECAPRAF–GL	8.572411865	2.927868143	2.297112936	165

Table 3. The results of the proposed method for the linear model in the clusters.

		Algorithm	MSE	RMSE	MAE	Selected Trees
Class 1	Ntree = 100	ECAPRAF–Lasso	8.771454854	2.941930802	2.340062232	38
		ECAPRAF–EN	8.755369012	2.939273653	2.338975803	49
		ECAPRAF–GL	8.834635058	2.952447098	2.353491403	50
	Ntree = 240	ECAPRAF–Lasso	8.48091468	2.896388688	2.311120159	48
		ECAPRAF–EN	8.435537518	2.88973851	2.306854126	58
		ECAPRAF–GL	8.460982026	2.893231877	2.309807383	85
	Ntree = 200	ECAPRAF–Lasso	8.6025055	2.91579362	2.321329903	40
		ECAPRAF–EN	8.569467143	2.911295916	2.313871952	54
		ECAPRAF–GL	8.573476967	2.911484803	2.316801795	65
	Ntree = 300	ECAPRAF–Lasso	8.570308727	2.907906984	2.315400928	49
		ECAPRAF–EN	8.487885223	2.894469963	2.310524005	55
		ECAPRAF–GL	8.543043047	2.902892836	2.313360543	60
Class 2	Ntree = 200	ECAPRAF–Lasso	8.568160746	2.907691807	2.276576336	46
		ECAPRAF–EN	8.537230028	2.902463664	2.271534304	50
		ECAPRAF–GL	8.707072382	2.932313251	2.298910393	85
	Ntree = 260	ECAPRAF–Lasso	8.646290595	2.919224788	2.275972475	35
		ECAPRAF–EN	8.625681531	2.915495283	2.270765139	48
		ECAPRAF–GL	8.789112679	2.943872273	2.300908247	90
	Ntree = 600	ECAPRAF–Lasso	8.63919584	2.916575153	2.270767945	50
		ECAPRAF–EN	8.588438862	2.907950701	2.264518857	59
		ECAPRAF–GL	8.684550424	2.92488069	2.285283869	95
	Ntree = 700	ECAPRAF–Lasso	8.611153953	2.910771216	2.26987304	54
		ECAPRAF–EN	8.579999482	2.905078274	2.264884094	63
		ECAPRAF–GL	8.652976321	2.919281462	2.281626936	105

Table 4. Datasets summary.

Data Set	Number of Samples	Number of Variables	Number of Training Sample	Number of Test Sample
Boston House Prices	506	13	407	99
Real Estate Valuation	414	5	332	82

Table 5. The results of the proposed method for the real-estate-valuation dataset.

	Algorithm	WMSE	WRMSE	WMAE	Selected Trees
Ntree = 300	Random Forest	107.8263312	10.38394584	5.227023752	300
	ECAPRAF–Lasso	63.90547096	7.994089752	5.353364544	74
	ECAPRAF–EN	60.54003401	7.780747651	5.318630768	105
	ECAPRAF–GL	66.21407167	8.137202939	5.412917123	110
Ntree = 500	Random Forest	108.0136379	10.39296098	5.251413432	500
	ECAPRAF–Lasso	32.94807873	5.740041701	4.322224372	67
	ECAPRAF–EN	30.33592275	5.507805621	4.207588771	111
	ECAPRAF–GL	31.62320614	5.623451444	4.211087924	135
Ntree = 800	Random Forest	107.9326224	10.38906263	5.248644659	800
	ECAPRAF–Lasso	47.00931756	6.856334119	4.895265056	77
	ECAPRAF–EN	45.70566097	6.7605962	4.874371821	137
	ECAPRAF–GL	47.01065198	6.856431432	4.917456438	140
Ntree = 1000	Random Forest	107.5598427	10.37110614	5.241761999	1000
	ECAPRAF–Lasso	45.89325716	6.774456226	4.876710703	75
	ECAPRAF–EN	44.71192516	6.686697627	4.800701571	141
	ECAPRAF–GL	47.84737551	6.917179737	4.954152998	155

Table 6. The results of the proposed method for the Boston house-price dataset.

	Algorithm	WMSE	WRMSE	WMAE	Selected Trees
Ntree = 300	Random Forest	12.05356086	3.471823852	2.192973908	300
	ECAPRAF–Lasso	11.2855362	3.359395213	2.262931613	84
	ECAPRAF–EN	10.55174201	3.248344504	2.157154185	106
	ECAPRAF–GL	11.57883518	3.402768752	2.226525129	105
Ntree = 500	Random Forest	11.85795023	3.443537459	2.190326265	500
	ECAPRAF–Lasso	10.93821866	3.307297788	2.234521671	92
	ECAPRAF–EN	10.38014764	3.221823651	2.140801457	128
	ECAPRAF–GL	10.53635914	3.245975837	2.151589584	100
Ntree = 800	Random Forest	11.88962813	3.448134007	2.125427866	800
	ECAPRAF–Lasso	11.41440075	3.378520497	2.277955818	100
	ECAPRAF–EN	9.588927134	3.096599285	2.076992437	129
	ECAPRAF–GL	10.97505102	3.312861454	2.231254518	105
Ntree = 1000	Random Forest	11.73102545	3.425058459	2.116374462	1000
	ECAPRAF–Lasso	8.927716935	2.987928536	2.041401696	100
	ECAPRAF–EN	8.67016921	2.944515106	2.021913684	128
	ECAPRAF–GL	9.407644139	3.067188312	2.055927983	140

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Farhadi, Z.; Bevrani, H.; Feizi-Derakhshi, M.-R.; Kim, W.; Ijaz, M.F. An Ensemble Framework to Improve the Accuracy of Prediction Using Clustered Random-Forest and Shrinkage Methods. Appl. Sci. 2022, 12, 10608. https://doi.org/10.3390/app122010608

AMA Style

Farhadi Z, Bevrani H, Feizi-Derakhshi M-R, Kim W, Ijaz MF. An Ensemble Framework to Improve the Accuracy of Prediction Using Clustered Random-Forest and Shrinkage Methods. Applied Sciences. 2022; 12(20):10608. https://doi.org/10.3390/app122010608

Chicago/Turabian Style

Farhadi, Zari, Hossein Bevrani, Mohammad-Reza Feizi-Derakhshi, Wonjoon Kim, and Muhammad Fazal Ijaz. 2022. "An Ensemble Framework to Improve the Accuracy of Prediction Using Clustered Random-Forest and Shrinkage Methods" Applied Sciences 12, no. 20: 10608. https://doi.org/10.3390/app122010608

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Ensemble Framework to Improve the Accuracy of Prediction Using Clustered Random-Forest and Shrinkage Methods

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Prediction

3.1.1. K-Means Clustering

3.1.2. Random-Forest Algorithm

3.2. Shrinkage

3.2.1. Lasso Regression

3.2.2. Elastic Net Regression

3.2.3. Group-Lasso Regression

4. Structure of Ensemble of Clustered and Penalized Random-Forest Model

5. Simulation Study

5.1. Simulation Study Design

5.2. Simulation Results

6. The Real Data Analysis

7. Discussions

8. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI