Next Article in Journal
Integrated Generative Adversarial Networks and Deep Convolutional Neural Networks for Image Data Classification: A Case Study for COVID-19
Previous Article in Journal
KEGGSum: Summarizing Genomic Pathways
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information

1
School of Software, Jiangxi Normal University, Nanchang 330022, China
2
College of Computer Science, Shenyang Aerospace University, Shenyang 110136, China
3
College of Information Science and Technology, Northeast Normal University, Changchun 130117, China
*
Authors to whom correspondence should be addressed.
Information 2024, 15(1), 57; https://doi.org/10.3390/info15010057
Submission received: 1 November 2023 / Revised: 30 December 2023 / Accepted: 14 January 2024 / Published: 17 January 2024
(This article belongs to the Section Artificial Intelligence)

Abstract

:
As the feature dimension of data continues to expand, the task of selecting an optimal subset of features from a pool of limited labeled data and extensive unlabeled data becomes more and more challenging. In recent years, some semi-supervised feature selection methods (SSFS) have been proposed to select a subset of features, but they still have some drawbacks limiting their performance, for e.g., many SSFS methods underutilize the structural distribution information available within labeled and unlabeled data. To address this issue, we proposed a semi-supervised feature selection method based on an adaptive graph with global and local constraints (SFS-AGGL) in this paper. Specifically, we first designed an adaptive graph learning mechanism that can consider both the global and local information of samples to effectively learn and retain the geometric structural information of the original dataset. Secondly, we constructed a label propagation technique integrated with the adaptive graph learning in SFS-AGGL to fully utilize the structural distribution information of both labeled and unlabeled data. The proposed SFS-AGGL method is validated through classification and clustering tasks across various datasets. The experimental results demonstrate its superiority over existing benchmark methods, particularly in terms of clustering performance.

1. Introduction

High-dimensional data can describe real-world things more realistically and effectively. However, these data might include vast redundant and irrelevant information. If we process these data directly, it not only consumes a large amount of storage space and computational resources but also leads to the performance degradation of existing models [1]. Therefore, it is necessary to mine the potential relationships between the data to select and learn useful feature information.
Feature representation learning (FRL) is one of the most effective methods of learning useful feature information. Among the existing FRL methods, feature extraction (FE) [2] and feature selection (FS) [3] are two representative methods. FE aims to map the original high-dimensional feature space to a low-dimensional subspace according to some pre-defined criteria [4]. FS selects an optimal feature subset from the original feature set based on evaluation metrics [5]. In comparison, FS is more interpretable than FE since it can remove irrelevant and redundant features from the original features and retain a small number of relevant features. Therefore, FS is widely used in image classification, bioinformatics, face recognition, medical image analysis, natural language processing, and other fields [6].
FS methods can be divided into unsupervised feature selection (UFS), supervised feature selection (SFS), and semi-supervised feature selection (SSFS). UFS methods can achieve feature selection by only using unlabeled data; they have received widespread attention since they do not require any labeled data. However, a lack of label-guided learning in UFS methods will lead to poor performance on practical application tasks [7]. Thus, SFS methods have been devised to leverage the label information of the sample to guide the process of FS, enhancing the distinctiveness and consequently of selected features, improving the performance of the classification and clustering [8]. However, obtaining ample labeled data is very challenging and time consuming in practical situations. For this reason, many SSFS methods have been proposed in the past decades. SSFS methods employ semi-supervised learning (SSL) to leverage the information of limited labeled data and a substantial volume of unlabeled data, enhancing the feature selection ability of the model [9]. The existing SSFS methods can be classified as filtered, wrapped, and embedded methods [10]. Filtered methods first evaluate each feature based on the principles of statistical or information theory and then perform the process of FS in terms of the calculated weights. A major benefit of filtering methods is that they are more applicable to large-scale datasets since they have high speed and computational efficiency. However, filtered methods may ignore the amount of redundant information generated by the combination of multiple features [11]. Thus, some wrapped approaches have been proposed to exploit the interrelationship of features to mine the best combination of features. However, these approaches have high computational complexity. This makes them unsuitable for processing large-scale data [12]. In contrast to the above-mentioned methods, embedded methods combine FS and model training together. That is, FS is automatically executed during the process of model training, which makes embedded methods improve the efficiency of FS by reducing runtime [13]. Therefore, they have become mainstream and widely used in various scenarios.
In recent years, several semi-supervised embedded feature selection (SSEFS) methods have emerged. For example, Zhao et al. [14] introduced an SSFS method using both labeled and unlabeled data. Recognizing sparse regularization is an effective strategy for selecting useful features and reducing feature representation dimensions [15]. Chen et al. [16] introduced an efficient semi-supervised feature selection (ESFS) method. ESFS first combines SSL and sparse regularization to obtain feature subsets. Then, it uses probability matrices of unlabeled data to measure feature relevance to the class, aiming to identify the globally optimal feature subset. Least squares regression (LSR) with complete statistical theory can handle noisy data effectively and thus improve computational efficiency [17]. Therefore, Chang et al. [18] proposed a convex sparse feature selection (CSFS) method based on LSR, which employs the convex optimization theory to fit samples and predict labels to select the most critical features using constraint terms. Chen et al. [19] contended that LSR-based feature selection lacks interpretability and struggles to identify a global sparse solution. Hence, they proposed an embedded SSFS method based on rescaled linear regression, which exploits the L21 norm to obtain both global and sparse solutions. Moreover, they also introduced a sparse regularization with an implicit L2p norm to obtain sparser and more interpretable solutions [20]. Therefore, this approach effectively constrains regression coefficients, achieving feature ranking. Besides, Liu et al. [21] combined sparse features and considered the correlation of samples in the original high- and low-dimensional spaces to improve the performance of feature learning. Despite the good results achieved by the sparse model-based methods, there are still some problems.
The first problem is that most of the methods do not consider constructing graphs to better preserve the geometric structural information of the data during the FS process. Initially, KNN was adopted by some FS methods to construct graphs based on Euclidean distances [22,23,24,25]. To minimize the influence of the redundant features and noise in the original high-dimensional data on the constructing graph process, Chen et al. [26] employed local discriminant analysis (LDA) to map the data from high-dimensional space to low-dimensional space. Subsequently, numerous graph construction methods based on data correlation have been presented, including L1 graph [27], low-rank representation (LRR) [28], local structure learning [29], and sparse subspace clustering (SSC) [30], to construct high-quality graphs. The above-mentioned graph construction methods are integrated into FS models, proposing a large number of improvements for feature selection [31,32,33,34,35,36,37]. However, the processes of adaptive graph construction and FS in the above-mentioned methods are independent of each other, so the influence of graph construction on the FS process is limited. To this end, some methods have been constructed to unify adaptive graph learning (AGL) and FS into a single framework [38,39,40,41].
The second problem is that the spatial distribution of the sample label information is not sufficiently considered, resulting in the weak discriminative ability of the selected features, which further leads to poor classification or clustering performance. To alleviate this issue, label propagation (LP) has been incorporated into the FS methods [42,43,44]. However, since LP is also a graph-learning-based algorithm, the quality of the learned graph affects the performance to some extent. Therefore, numerous methods have emerged to merge AGL and LP [45,46,47,48]. However, these methods still have the following limitations: (1) the process of AGL is based on the original data; (2) the process of adaptive graph construction only considers the local structure or the global structure. Therefore, these methods are inevitably affected by high-dimensional features or noisy data.
To address the above-mentioned issues, this study develops a novel SSFS framework, SFS-AGGL, which integrates FS, AGL, and LP to capture both global and local data structural information for selecting an optimal feature subset with maximum discrimination and minimum redundancy. In AGL, global and local constraints are imposed on the construction coefficient obtained by the self-representation of low-dimensionally selected features. Meanwhile, the similarity matrix obtained by AGL is integrated into LP, enhancing label prediction performance. To improve the discriminative ability of the selected features, the predicted label matrix is introduced into the sparse feature selection (SFS) process. SFS is performed through the mutual promotion of the three models. The framework of the proposed SFS-AGGL is shown in Figure 1.
The primary contributions of this paper are as follows:
(1) An efficient SSFS framework is proposed by combining the advantages of FS, adaptive learning, and LP.
(2) An adaptive learning strategy based on low-dimensional features is designed to counteract the influence of high-dimensional features or noise data. Moreover, global and local constraints are introduced.
(3) An LP based on an adaptive similarity matrix is introduced to enhance label prediction accuracy.
(4) Comprehensive experiments conducted on multiple real datasets demonstrate that the proposed SFS-AGGL method surpasses existing representative methods in classification and clustering tasks.
The rest of this paper is organized as follows: Section 2 describes some related work; Section 3 outlines the details of the proposed method and the iterative minimization strategy employed to optimize the objective function; Section 4 introduces the experimental setup and provides a comprehensive analysis of the obtained results, including comparisons with eight state-of-the-art methods on five real datasets; and Section 5 provides a summary of our work in this paper.

2. Related Work

In this section, we have first provided some commonly used notations. Then, sparse representation and graph construction methods are introduced. Finally, some semi-supervised feature selection methods are briefly reviewed.

2.1. Notations

Let X = [ X l , X u ] = [ x 1 , , x l , x l + 1 , , x l + 1 + u ] R m × n denote the training samples, where x i R m denotes the i-th sample. Y = [ Y l   Y u ] T R c × n is the label matrix, and Y l denotes the true label of the labeled sample. If the sample x i belongs to the class j, then its corresponding class label is Y i j = 1 ; otherwise, Y i j = 0 . Y u denotes the true label of the unlabeled sample. Since Y u is unknown during the training process, it is set as a 0 matrix during training [49]. The main symbols in this paper are presented in Table 1.
Common matrix norms include L1, L2, F, and L21 norms. Their detailed definitions are as follows:
| | B 1 = i = 1 m j = 1 n | b i j |
| | B 2 = i = 1 m b i 2
| | B F = i = 1 m j = 1 n b i j 2
| | B 2 , 1 = i = 1 m B i | | 2 = i = 1 m j n b i j 2
where B i is the i-th row vector of the matrix B . According to the matrix computation theory, | | B 2 , 1 = t r ( B T U B ) , U R m × m is a matrix consisting of diagonal elements u i i = 1 / B i | | 2 .

2.2. Sparse Representation

Sparse representation is a method that was first developed in signal processing. The core idea of sparse representation is to find a target dictionary to describe the signal. To be specific, the original signal can be decomposed into linear combinations of elements in the dictionary. Only a few non-zero elements are used to represent signal information, while the rest can be ignored. Given a sample X R n and a target dictionary D , it is desired to find a coefficient vector a such that the signal X can be represented as a linear combination of the basic elements of the target dictionary D .
min α α 0 s . t .   X = D × α .
where α R d is a one-dimensional vector [50] and | | α 0 is the L0 norm of α . Due to the non-convexity and discontinuity of the L0 norm, the L1 norm is usually used to replace the L0 norm to obtain an approximate solution, as shown in the following formula:
min α α 1 s . t .   X = D × α .
Compared with the L1 norm, the continuous derivability property of the L2 norm can make the optimization algorithm more intuitive. Hence, the L2 norm is commonly used to control overfitting, which can make the weight parameters of the model smoother and avoid overly complex models, as shown in the following formula:
min α α 2 s . t .   X = D × α .
However, the disadvantage of the L2 norm is that the model parameters will be close to 0, but most of them cannot be 0. Therefore, the L21 norm, which is between the L1 and L2 norms, is proposed as an effective scheme, as shown in Equation (8):
min α α 21 s . t .   X = D × α .
The advantage of L21 norm is that it can make the elements of the whole row 0, thus achieving a similar sparse effect as L1 and more robustness.

2.3. Constructing Graph Methods

KNN graph is a widely used method for constructing a similarity matrix. Sij is the similarity of the sample xi and xj, which is defined as:
S i j = exp | | x i x j | | 2 2 δ 2 if   x i N k ( x j )   or   x j N k ( x i ) , 0 . else .
where N k ( x j ) is a set that contains k nearest neighbor samples of the sample x j and δ is a parameter.
From Equation (9), it can be seen that as the samples get closer, their similarity also increases. In addition, there are some similar methods, such as the ϵ-neighborhood method [51] and the fully connected method [52], which can also be utilized to construct graphs.
Unlike the KNN graph, the L1 graph is an adaptive graph learning mechanism method that aims to reconstruct each sample by find the best sparse linear combination of other samples. The objective function of the L1 graph can be described as follows:
min α i 1 s . t .     x i = X α i , α i i = 0 .
Then, the weight matrix formed by the L1 graph is expressed as S = [ α 1 , α 2 , , α N ] . Compared with KNN graphs, L1 graphs can adaptively select the nearest samples for each sample.

2.4. Label Propagation Algorithm

The label propagation (LP) algorithm is a graph-based semi-supervised classification method that can effectively classify unknown samples using a small number of labeled samples. In the LP algorithm, similar samples should have similar labels. Therefore, the objective function of LP can be expressed as:
min F 0 i = 1 N j = 1 N | | f i f j | | 2 2 s i j + i = 1 N f i y i | | 2 2 u i i = t r ( F T L F ) + t r ( ( F Y ) U ( F Y ) T )
where s i j can be computed by Equation (9) or Equation (10). U R m × m is a diagonal matrix that effectively utilizes category information from all samples in SSL. The diagonal elements of this matrix are defined as follows:
u i i =   if   x i   is   unlabeled , 0     otherwise .
where the symbol represents a relatively large constant. The first term in Equation (11) is based on the similarity of the data, which assigns similar labels to the neighboring samples to keep the graph as smooth as possible. The second term aims to minimize the difference between the matrix F and the label Y, i.e., the sample labels predicted by the trained model should be as consistent as possible with the true labels.

2.5. The Graph-Based Semi-Supervised Sparse Feature Selection

Sparse learning is widely used in machine learning due to its superior feature extraction capabilities. In this context, sparse regularization terms are used to penalize the projection matrix with the aim of selecting features with high sparsity and high discriminative properties. The following equation is commonly used for sparse feature selection:
min W   L o s s ( X , W , Y ) + θ R ( W )
where the   L o s s ( X , W , Y ) is defined as a regression term and R ( W ) is a sparse regularization term, and θ 0 is a regularization weight to constrain both terms.
As we know, selecting features only using the information of the labeled sample is inaccurate and unreliable since the labeled samples are insufficient in SSL. Therefore, it is also necessary to make full use of the information of unlabeled samples to improve the performance. The following model can achieve feature selection by introducing an LP algorithm into the semi-supervised sparse model.
min W , F   L o s s 1 ( X , W , F ) + θ R ( W ) + α L o s s 2 ( F , Y )
where L o s s 2 ( F , Y ) is the objective function of the LP algorithm shown in Equation (11).
It can be seen that when constructing the model above, the merits of the similarity matrix construction directly determine the performance. To alleviate the issue, the following graph-based semi-supervised sparse feature selection model has been developed.
min W , F 0 , S 0   L o s s 1 ( X , W , Y ) + θ R ( W ) + α L o s s 2 ( F , Y , S ) + β L o s s 3 ( X , S )
where α , β , θ 0 are model parameters. L o s s 3 ( X , S ) is the objective function of the graph learning. Some adaptively constructed graph methods have been designed [38,39,40,41,48,53].

3. The Proposed Method

In this part, a detailed introduction of the SFS model is first presented. Second, a new AGL mechanism is introduced to make full use of the global and local information between the samples, which can acquire the geometric structural information of the original data well. Next, the similarity matrix learned by the AGL mechanism is integrated into the LP algorithm, which enhances label prediction performance and allows the model to classify and cluster unlabeled samples more accurately. Finally, the SFS, AGL, and LP models are fused in a unified framework to propose a novel SFS-AGGL method. Moreover, a new iterative updated algorithm is introduced to optimize the proposed model, and its convergence is confirmed through both theoretical and experimental testing.

3.1. Methodology Model

3.1.1. SFS Model

The L21 sparsity constraint is applied to achieve the process of FS. In combination with LSR, a basic SFS model can be obtained as follows:
min W   X T W Y 2 2 + θ W 2 , 1 s . t .   W 0 .
where W R d × c denotes the feature projection matrix and θ is a regularization parameter.

3.1.2. Global and Local Adaptive Graph Learning (AGGL) Model

Although the sparse model-based approach has achieved good results in FS, there are still some problems, for e.g., the above-mentioned sparse model only focuses on the sample–label relationship and ignores the geometric structural information among the samples. To better preserve the original data’s geometric structural information, the method of adaptively constructing the nearest neighbor graph is usually adopted. However, the nearest neighbor information in the original feature space may be disturbed by redundant and noisy features. Previous research has shown that feature projection can effectively mitigate the negative impact of redundant and noisy features [54]. Therefore, when learning the nearest neighbor graph, the similarity matrix should be constructed through adaptive updates of sample similarities and their neighboring samples in the projected feature space. Hence, in this paper, the similarity of samples in the original high-dimensional space and the low-dimensional space is utilized to describe the local distribution structure more accurately, thus enhancing the effectiveness of the graph learning task. Specifically, we have used a coefficient reconstruction method to construct the graph, leading to the subsequent model:
min W , S W T X W T X S 2 2 s . t .   W 0 , S 0 .
where S R n × n and s i = [ s i 1 , s i 2 , , s i n ] n × 1 denotes the reconstructed coefficient vector of the sample.
The maintenance of global and local sample information is crucial for sample reconstruction. That is, the similarity between the sample that needs to be reconstructed and its surrounding samples should be maintained in the process of sample reconstruction. To achieve this goal, we have incorporated global and local constraints into the sample reconstruction process. This ensures that the sample points are better reconstructed by the most adjacent sample points, thereby improving the quality of the construction graph. Specifically, we have combined global and local constraints with sparse learning to reconstruct samples, as shown in the following formula:
S E 1
where E = [ e i j ] R n × n and each element e i j in E is defined as:
e i j = exp x i x j 2 σ 2
By combining Equations (17) and (18), the following adaptive graph construction model with global and local constraints is obtained:
min W , S W T X W T X S 2 2 + λ S E 1 s . t .   W 0 , S 0 .
where λ > 0 is the balance coefficient, which aims to balance the effects of the coefficient reconstruction term and the global and local constraint terms. By constructing the above model, we can effectively maintain the global and local information of the sample, thereby enhancing the similarity matrix of the graph.

3.1.3. Objective Function

As can be seen in Equation (17), the SFS model only utilizes the labeling information of the data. It ignores the spatial distribution of the labels, making it difficult to select the ideal subset of features. It has been shown that the structural distribution information embedded in unlabeled data is very important for FS when there is less labeling information [55]. For this reason, we have introduced the LP algorithm. Meanwhile, to make the LP process more efficient, we have introduced the adaptive graph coefficient matrix obtained by Equation (20) into LP. Therefore, a new SFS-AGGL algorithm is proposed by integrating SFS, AGGL, and LP into a unified learning framework. SFS-AGGL can account for both global and local sample information, and it is robust for FS. The objective function of SFS-AGGL is:
min ε ( W , F , S ) = β W T X W T X S | | 2 2 + λ S E | | 1 + α i , j = 1 n f i f j | | 2 2 S i j + i = 1 n f i y i | | 2 2 u i i + X T W F | | 2 2 + θ W | | 2 , 1 s . t .   W 0 , F 0 , S 0 .
where α , β , θ , λ > 0 are the equilibrium control parameters to be adjusted in the experiment, and denotes the product of matrix elements in their corresponding positions.
As shown in Equation (21), we first efficiently obtained the constructive coefficients by imposing global and local constraints while self-representing the low-dimensional features. Therefore, it can avoid possible redundant information to affect the learning performance due to predefined matrices not being introduced. Second, we introduced the similarity matrix obtained by AGL into the LP process to improve the accuracy of label prediction. In addition, to enhance the discriminative performance of the selected features, we introduced a predictive labeling matrix into the SFS process and completed the FS by mutual reinforcement of the three models: SFS, AGGL, and LP.

3.2. Model Optimization

The objective function of the SFS-AGGL method involves three variables, i.e., the feature projection matrix W, the prediction label matrix F, and the similarity matrix S. Since the objective functions of all three variables are non-convex, they cannot be optimized directly. However, the objective function exhibits convexity with respect to a single variable. Therefore, we can solve it step-by-step by performing convex optimization on each variable separately. The specific process of solving the objective function is as follows:
(1) Fixed variables F and S update variable W
Simplifying Equation (21) by removing the terms unrelated to the variable W, the following optimization function is obtained:
min ε ( W ) = X T W F | | 2 2 + β W T X W T X S | | 2 2 + θ W | | 2 , 1
From the definition of matrix trace, Equation (23) can be derived from Equation (22) by using a simple algebraic transformation as follows:
min ε ( W ) = t r ( ( X T W F ) ( X T W F ) T ) + β t r ( ( W T X W T X S ) ( W T X W T X S ) T ) + θ t r ( W T H W ) = t r ( X T W W T X 2 F W T X + F F T ) + β t r W T X X T W 2 W T X S T X T W + W T X S S T X T W + θ t r ( W T H W )
To solve Equation (23), a Lagrange multiplier and the corresponding Lagrange functions are introduced, which can be constructed as follows:
ε ( W , ϑ ) = t r X T W W T X 2 F W T X + F F T + β W T X X T W 2 β W T X S T X T W + β W T X S S T X T W + θ W T H W + t r ( ϑ W )
Next, the partial derivative regarding the variable W is computed and then set to 0 as follows:
ε ( W , ϑ ) W = 2 X X T W 2 X F + 2 β X X T W 4 β X S T X T W + 2 β X S S T X T W + 2 θ W T H W + ϑ = 0
Meanwhile, by combining the Karush–Kuhn–Tucker (KKT) condition ( ϑ i j W i j = 0 ) , we can obtain Equation (26) as follows:
2 X X T W 2 X F + 2 β X X T W 4 β X S T X T W + 2 β X S S T X T W + 2 θ H W i j W i j = 0
Therefore, an updated rule for the variable W can be obtained:
W i j = W i j [ X F + 2 β X S T X T W ] i j [ X X T W + β X X T W + β X S S T X T W + θ H W ] i j
(2) Fixed variables W and S update variable F
We first remove the terms unrelated to the variable F from Equation (21), and the optimization function on the variable F is acquired as:
min ε ( F ) = X T W F 2 2 + α i , j = 1 n f i f j | | 2 2 S i j + i = 1 n f i y i | | 2 2 u i i
According to the definition of matrix trace, we can use a simple algebraic transformation to obtain Equation (29) as follows:
min ε ( F ) = t r ( ( X T W F ) ( X T W F ) T ) + α t r ( F T L F ) + t r ( ( F Y ) ( F Y ) T ) = t r ( X T W W T X 2 F W T X + F F T ) + α t r ( F T L F ) + t r ( F U F T 2 F U Y T + Y U Y T ) = t r ( X T W W T X 2 F W T X + F F T + α F T L F + F U F T 2 F U Y T + Y U Y T )
Next, we have introduced a Lagrange multiplier to optimize Equation (29), and the corresponding Lagrange function can be defined as follows:
ε ( F , μ ) = t r X T W W T X 2 F W T X + F F T + α F T L F + F U F T 2 F U Y T + Y U Y T + t r ( μ F )
Then, we have calculated the partial derivative with respect to the variable F and set it to 0 as follows:
ε ( F , μ ) F = ( 2 X T W + 2 F + 2 α L F + 2 F U 2 Y U T + μ ) = 0
Following the KKT condition ( μ i j F i j = 0 ) , we can derive Equation (32) as shown:
( 2 X T W + 2 F + 2 α L F + 2 F U 2 Y U T ) i j F i j = 0
Finally, we have provided an iterative updated rule for the variable F as follows:
F i j = F i j [ X T W Y U T ] i j [ F + α L F + F U ] i j
(3) Fixed variables W and F update variable S
Likewise, by removing the terms unrelated to the variable S, the optimization function becomes the following form:
min ε ( S ) = α i , j = 1 n f i f j 2 2 S i j + β W T X W T X S 2 2 + λ S E 1
Equation (34) can be reduced to Equation (35) as follows:
min ε ( S ) = α t r ( F T L F ) + β t r ( ( W T X W T X S ) ( W T X W T X S ) T ) + λ S E = α t r ( F T L F ) + β t r ( W T X X T W 2 W T X S T X T W + W T X S S T X T W ) + λ S E = t r ( α F T D F α F T S F + β W T X X T W 2 β W T X S T X T W + β W T X S S T X T W ) + λ S E
Here, a Lagrange multiplier is utilized to determine the optimal solution of Equation (35), and the related Lagrange function is formulated as:
ε ( S , ξ ) = t r ( α F T D F α F T S F + β W T X X T W 2 β W T X S T X T W + β W T X S S T X T W ) + λ S E + t r ( ξ S )
The partial derivative regarding the variable S is then set to 0 as follows:
ε ( S , ξ ) S = ( α F F T 2 β X T W W T X + 2 β X T W W T X S + λ E + ξ ) = 0
Since the KKT condition ( ξ i j S i j = 0 ) exists, we can obtain Equation (38) as follows:
( α F F T 2 β X T W W T X + 2 β X T W W T X S + λ E ) i j S i j = 0
Therefore, an expression of the following form for the variable S can be obtained:
S i j = S i j [ α F F T + 2 β X T W W T X ] i j [ 2 β X T W W T X S + λ E ] i j

3.3. Algorithm Description

Algorithm 1 describes the SFS-AGGL method in detail, while Figure 2 depicts its flowchart. Moreover, the SFS-AGGL algorithm stops iterating when the alteration of the objective function value between consecutive iterations is below a threshold or the maximum number of iterations is reached.
Algorithm 1: SFS-AGGL
Input: Sample Matrix: X = [ X L , X U ] R d × n
    Label Matrix: Y = [ Y l ; Y u ] T R n × c
    Parameters: α 0 , β 0 , θ 0 , λ 0
Output: Feature Projection Matrix W
      Predictive Labeling Matrix F
      Similarity Matrix S
1: Initialization: the initial non-negative matrix W 0 , F 0 , S 0 , i t e r = 0 ;
2: Calculation of the matrix U i t e r according to Equation (12);
3: Repeat
4:   According to Equation (27) update  W i t e r as
          W i t e r X F + 2 β X S T X T W X X T W + β X X T W + β X S S T X T W + θ H W ;
5:   According to Equation (33) update F i t e r as F i t e r X T W Y U T F + α L F + F U ;
6:   According to Equation (39) update S i t e r as S i t e r α F F T + 2 β X T W W T X 2 β X T W W T X S + λ E ;
7:   According to Equation (19) update  E ;
8: Update  i t e r = i t e r + 1 ;
9: Until converges

3.4. Computational Complexity and Convergence Analysis

3.4.1. Computational Complexity Analysis

Based on Algorithm 1, the SFS-AGGL algorithm’s computational complexity comprises two parts. The first part is the computation of the diagonal auxiliary matrix U in step 2, and the second part is the updating of three matrices (W, F, and S) during each iteration and the computation of the local matrix E in step 7. The computational or updating components of each matrix are defined in Table 2. Therefore, the total complexity of the SFS-AGGL algorithm is O ( max ( k n 2 , c n 2 ) + ( i t e r × max ( c m n , c n 2 ) ) , where iter is the iteration count. Furthermore, the computational complexities of other related FS methods are also presented in Table 3.

3.4.2. Proof of Convergence

Definition 1. 
If functions φ ( q , q ) and ψ ( q ) meet these two conditions, as shown in Equation (40), φ ( q , q ) is an auxiliary function of ψ ( q ) .
φ ( q , q ) ψ ( q ) , φ ( q , q ) = ψ ( q ) ,
Lemma 1. 
If Definition 1 holds, ψ ( q ) is non-increasing in Equation (41).
q ( i t e r + 1 ) = arg min q φ ( q , q ( i t e r ) )
Proof. 
ψ ( q ( i t e r + 1 ) ) φ ( q ( i t e r + 1 ) , q ( t i e r ) ) φ ( q ( i t e r ) , q ( i t e r ) ) = ψ ( q ( i t e r ) )
It is only necessary to show that the variables W, F, and S are non-decreasing under the update rule as shown in Equation (42). For this purpose, we have computed and presented the first- and second-order derivatives of each formula in Table 4. □
Lemma 2. 
φ ( W i j , W i j ( i t e r ) ) = ψ i j ( W i j , W i j ( i t e r ) ) + ψ i j ( W i j ) ( W i j W i j ( i t e r ) ) + [ X X T W + β X X T W + β X S S T X T W + θ H W ] i j W i j ( i t e r ) ( W i j W i j ( i t e r ) ) 2
φ ( F i j , F i j ( i t e r ) ) = ψ i j ( F i j , F i j ( i t e r ) ) + ψ i j ( F i j ) ( F i j F i j ( i t e r ) ) + [ F + α L F       + F U ] i j F i j ( i t e r ) ( F i j F i j ( i t e r ) ) 2
φ ( S i j , S i j ( i t e r ) ) = ψ i j ( S i j , S i j ( i t e r ) ) + ψ i j ( S i j ) ( S i j S i j ( i t e r ) ) + [ 2 β X T W W T X S + λ E ] i j S i j ( i t e r ) ( S i j S i j ( i t e r ) ) 2
Equations (43)–(45) are both auxiliary functions of ψ i j .
Proof. 
Taylor series expansion of ψ i j ( W i j , W i j ( i t e r ) ) :
ψ i j ( W i j ) = ψ i j ( W i j ( i t e r ) ) + ψ i j ( W i j ) ( W i j W i j ( i t e r ) ) + 1 2 ψ i j ( W i j ) ( W i j W i j ( i t e r ) ) 2 = ψ i j ( W i j ( i t e r ) ) + ψ ( V i j ) ( W i j W i j ( i t e r ) ) + X X T + β X X T 2 β X S X T + β X S S T X T + θ H T i i ( W i j W i j ( i t e r ) ) 2
φ ( W i j , W i j ( i t e r ) ) ψ i j ( W i j ) Equivalent to:
[ X X T W + β X X T W + β X S S T X T W + θ H W ] i j W i j ( i t e r ) ( X X T + β X X T 2 β X S X T + β X S S T X T + θ H T ) i i
By comparing Equations (43) and (46), we get that φ ( W i j , W i j ( i t e r ) ) ψ i j ( W i j ) holds, therefore, φ ( W i j , W i j ( i t e r ) ) = ψ i j ( W i j ) also holds.
[ X X T W ] i j = k = 1 r ( X X T ) i k W k j i t e r ( X X T ) i j W i j i t e r
[ X S S T X T W ] i j = k = 1 r ( X S S T X T ) i k W k j i t e r ( X S S T X T ) i j W i j i t e r
[ H W ] i j = k = 1 r H i k W k j i t e r H i i W i j i t e r
Similarly, it is possible to prove Equations (44) and (45). Finally, based on Lemma 1, the update schemes for the variables W, F, S are derived in this paper as shown in Equations (51)–(53).
Theorem 1. 
W 0 , F 0 , and S 0 , updating iterative Formulas (27), (33), and (39) are non-increasing.
Proof. 
 
Bringing Equation (43) into Equation (41):
W i j ( i t e r + 1 ) = arg min W i j φ ( W i j , W i j ( i t e r ) ) = W i j ( i t e r ) W i j ( i t e r ) ψ ( W i j ) [ X X T W + β X X T W + β X S S T X T W + θ H W ] i j = W i j ( i t e r ) [ X F + 2 β X S T X T W ] i j [ X X T W + β X X T W + β X S S T X T W + θ H W ] i j
Bringing Equation (44) into Equation (41):
F i j ( i t e r + 1 ) = arg min F i j φ ( F i j , F i j ( i t e r ) ) = F i j ( i t e r ) F i j ( i t e r ) ψ ( F i j ) [ F + α L F + F U ] i j = F i j ( i t e r ) [ X T W Y U T ] i j [ F + α L F + F U ] i j
Bringing Equation (45) into Equation (41):
S i j ( i t e r + 1 ) = arg min S i j φ ( S i j , S i j ( i t e r ) ) = S i j ( i t e r ) S i j ( i t e r ) ψ ( S i j ) [ 2 β X T W W T X S + λ E ] i j = S i j ( i t e r ) [ α F F T + 2 β X T W W T X ] i j [ 2 β X T W W T X S + λ E ] i j
It is obvious that Equations (51)–(53) are auxiliary functions of ψ i j , resulting in a non-increasing ψ i j under their respective update rules.
Next, the upcoming focus will be on demonstrating the convergence of iteration-based Algorithm 1.
For any non-zero vectors u R m and v R m , the following inequalities exist:
u 2 u 2 2 2 v | | 2 u 2 u | | 2 2 2 u | | 2
The proof of Equation (54) can be found in the literature [55]. □
Theorem 2. 
Referring to Algorithm 1, Equation (21) decreases in each iteration until it converges.
Proof. 
Let H i t e r denote the process of the iter-th iteration, then updating W i t e r + 1 , F i t e r + 1 and S i t e r + 1 involves solving the given inequality:
ϕ ( W i t e r + 1 , F i t e r + 1 , S i t e r + 1 , H i t e r ) ϕ ( W i t e r , F i t e r , S i t e r , H i t e r )
According to Equation (55), we obtain:
t r ( X T W ( i t e r + 1 ) F ( i t e r + 1 ) ) ( X T W ( i t e r + 1 ) F ( i t e r + 1 ) ) T + α t r ( F ( i t e r + 1 ) ) T L F ( i t e r + 1 ) + t r ( F ( i t e r + 1 ) Y ) U ( F ( i t e r + 1 ) Y ) T + β t r ( ( W ( i t e r + 1 ) ) T X ( W ( i t e r + 1 ) ) T X S ( i t e r + 1 ) ) ( ( W ( i t e r + 1 ) ) T X ( W ( i t e r + 1 ) ) T X S ( i t e r + 1 ) ) T + θ t r ( W ( i t e r + 1 ) ) T H ( i t e r ) W ( i t e r + 1 ) + λ S ( i t e r + 1 ) E t r ( X T W ( i t e r ) F ( i t e r ) ) ( X T W ( i t e r ) F ( i t e r ) ) T + α t r ( F ( i t e r ) ) T L F ( i t e r ) + t r ( F ( i t e r ) Y ) U ( F ( i t e r ) Y ) T + β t r ( W ( i t e r ) ) T X ( W ( i t e r ) ) T X S ( i t e r ) ) ( ( W ( i t e r ) ) T X ( W ( i t e r ) ) T X S ( i t e r ) ) T + θ t r ( W ( i t e r ) ) T H ( i t e r ) W ( i t e r ) + λ S ( i t e r ) E
Again, based on the definition of matrix H i t e r , Equation (56) can be rewritten as:
t r ( X T W ( i t e r + 1 ) F ( i t e r + 1 ) ) ( X T W ( i t e r + 1 ) F ( i t e r + 1 ) ) T + α t r ( F ( i t e r + 1 ) ) T L F ( i t e r + 1 ) + t r ( F ( i t e r + 1 ) Y ) U ( F ( i t e r + 1 ) Y ) T + β t r ( ( W ( i t e r + 1 ) ) T X ( W ( i t e r + 1 ) ) T X S ( i t e r + 1 ) ) ( ( W ( i t e r + 1 ) ) T X ( W ( i t e r + 1 ) ) T X S ( i t e r + 1 ) ) T + θ i = 1 m | | ( W ( i t e r + 1 ) ) i | | 2 2 2 | | ( W ( i t e r + 1 ) ) i | | 2 + λ S ( i t e r + 1 ) E t r ( X T W ( i t e r ) F ( i t e r ) ) ( X T W ( i t e r ) F ( i t e r ) ) T + α t r ( F ( i t e r ) ) T L F ( i t e r ) + t r ( F ( i t e r ) Y ) U ( F ( i t e r ) Y ) T + β t r ( ( W ( i t e r ) ) T X ( W ( i t e r ) ) T X S ( i t e r ) ) ( ( W ( i t e r ) ) T X ( W ( i t e r ) ) T X S ( i t e r ) ) T + θ i = 1 m | | ( W ( i t e r ) ) i | | 2 2 2 | | ( W ( i t e r ) ) i | | 2 + λ S ( i t e r ) E
Thus, there is the following inequality:
t r ( X T W ( i t e r + 1 ) F ( i t e r + 1 ) ) ( X T W ( i t e r + 1 ) F ( i t e r + 1 ) ) T + α t r ( F ( i t e r + 1 ) ) T L F ( i t e r + 1 ) + t r ( F ( i t e r + 1 ) Y ) U ( F ( i t e r + 1 ) Y ) T + β t r ( ( W ( i t e r + 1 ) ) T X ( W ( i t e r + 1 ) ) T X S ( i t e r + 1 ) ) ( ( W ( i t e r + 1 ) ) T X ( W ( i t e r + 1 ) ) T X S ( i t e r + 1 ) ) T + θ i = 1 m ( W ( i t e r + 1 ) ) i | | 2 θ i = 1 m | | ( W ( i t e r + 1 ) ) i | | 2 i = 1 m | | ( W ( i t e r + 1 ) ) i | | 2 2 2 | | ( W ( i t e r + 1 ) ) i | | 2 + λ S ( i t e r + 1 ) E t r ( X T W ( i t e r ) F ( i t e r ) ) ( X T W ( i t e r ) F ( i t e r ) ) T + α t r ( F ( i t e r ) ) T L F ( i t e r ) + t r ( F ( i t e r ) Y ) U ( F ( i t e r ) Y ) T + β t r ( ( W ( i t e r ) ) T X ( W ( i t e r ) ) T X S ( i t e r ) ) ( ( W ( i t e r ) ) T X ( W ( i t e r ) ) T X S ( i t e r ) ) T + θ i = 1 m ( W ( i t e r ) ) i | | 2 θ i = 1 m | | ( W ( i t e r ) ) i | | 2 i = 1 m | | ( W ( i t e r ) ) i | | 2 2 2 | | ( W ( i t e r ) ) i | | 2 + λ S ( i t e r ) E
From Equation (58), we have:
i = 1 m ( W ( i t e r + 1 ) ) i | | 2 i = 1 m ( W ( i t e r + 1 ) ) i | | 2 2 2 ( W ( i t e r + 1 ) ) i | | 2 i = 1 m ( W ( i t e r ) ) i | | 2 i = 1 m ( W ( i t e r ) ) i | | 2 2 2 ( W ( i t e r ) ) i | | 2
Considering Equations (55)–(59) together, the following results can be obtained:
t r ( X T W ( i t e r + 1 ) F ( i t e r + 1 ) ) ( X T W ( i t e r + 1 ) F ( i t e r + 1 ) ) T + α t r ( F ( i t e r + 1 ) ) T L F ( i t e r + 1 ) + t r ( F ( i t e r + 1 ) Y ) U ( F ( i t e r + 1 ) Y ) T + β t r ( ( W ( i t e r + 1 ) ) T X ( W ( i t e r + 1 ) ) T X S ( i t e r + 1 ) ) ( ( W ( i t e r + 1 ) ) T X ( W ( i t e r + 1 ) ) T X S ( i t e r + 1 ) ) T + θ i = 1 m ( W ( i t e r + 1 ) ) i | | 2 + λ S ( i t e r + 1 ) E t r ( X T W ( i t e r ) F ( i t e r ) ) ( X T W ( i t e r ) F ( i t e r ) ) T ) + α t r ( ( F ( i t e r ) ) T L F ( i t e r ) + t r ( F ( i t e r ) Y ) U ( F ( i t e r ) Y ) T + β t r ( ( W ( i t e r ) ) T X ( W ( i t e r ) ) T X S ( i t e r ) ) ( ( W ( i t e r ) ) T X ( W ( i t e r ) ) T X S ( i t e r ) ) T + θ i = 1 m ( W ( i t e r ) ) i | | 2 + λ S ( i t e r ) E
The inequality in Equation (60) shows the value of the objective function is decreased per iteration, indicating the optimization algorithm’s progress toward a more optimal solution at each step. In addition, since there is a lower bound on the objective function, our proposed optimization algorithm will converge. We also adopted numerical experiments to further verify the effectiveness of the optimization algorithm, and the experimental result demonstrates that the objective function value consistently decreases as the number of iterations increases.

4. Experiment and Analysis

In this section, the effectiveness of the proposed method is validated on classification and clustering tasks, respectively. We first used five image classification datasets to test the classification performance of the proposed method and then employed two image datasets and two subsets of UCI data to verify the clustering performance of the proposed method. In the experiment, we compared our proposed method with some contemporary UFS and SSFS methods, including two UFS methods (SPNFSR [56] and NNSAFS [57]) and six SSFS methods (RLSR [19], FDEFS [50], GS3FS [43], S2LFS [44], AGLRM [47], and ASLCGLFS [48]).

4.1. Description of the Comparison Methods

In order to verify the effectiveness of our method and comprehensively evaluate the strengths and weakness of our proposed method, we compared it with some classical and novel benchmark methods for unsupervised and semi-supervised FS, which are similar to our method. Compared to these existing methods, our method is an improvement and innovation of them, which is with a general tendency toward continuous improvement.
(1) SPNFSR is a UFS algorithm that uses a low-rank representation graph for maintaining feature structures, and it achieves FS by using the L21 norm and non-negative constraints on the reconstruction coefficient matrix. The objective function of the SPNFSR method can be defined as follows:
min X X W | | 2 , 1 + α t r ( W T M W ) + β W | | 2 , 1 s . t .   W 0 .
where the matrix M is obtained by solving the low-rank representation. In the SPNFSR method, the processes of graph construction and feature selection are performed independently, so the quality of the matrix M will directly affect the performance of feature selection.
(2) NNSAFS is a UFS algorithm that employs adaptive rank constraints and non-negative spectral feature learning. It employs sparse regression and feature mapping to mine the local structural information of the feature space to improve the adaptability of manifold learning. The objective function of NNSAFS can be defined as follows:
min X T W F 2 2 + α 1 W 1 + α 2 T r ( W T L W W ) + λ T r ( F T L S F ) + β i j ( s i j log s i j ) s . t .   W 0 , F T F = I , j = 1 n s i j = 1 , s i j > 0 .
where i j ( s i j log s i j ) is an entropy regularization term to estimate the uniformity of matrix S. Compared with the SPNFSR method, NNSAFS integrates graph learning and feature selection into a framework to overcome the shortcomings of the SPNFSR method. Moreover, local structural information of learned feature is also considered. However, since NNSAFS and SPNFSR are unsupervised methods and do not consider the label information of the data, they cannot select the features with good discriminability.
(3) RLSR is an SSFS method, which identifies key features by learning the global and sparse solutions of the feature projection matrix. It also redefines regression coefficients with a deflation factor, as shown in Equation (63):
min X T W + 1 b T Y | | F 2 + γ W | | 2 , 1 2 s . t .   W , b , Y U 0 , Y U 1 = 1 .
Different from the SPNFSR and NNSAFS methods, RLSR is a semi-supervised selection method that can use both labeled and unlabeled samples to improve the discriminability of features. Moreover, it also uses the L21 norm instead of the L1 norm to reduce the redundancy of selected features.
(4) FDEFS is a supervised or semi-supervised FS method that combines margin discriminant embedding, manifold embedding, and sparse regression to achieve feature selection.
min μ ( W | | 2 , 1 + γ X T W + 1 n b T F | | 2 2 ) + t r ( Z T L 1 Z ) s . t .   L 1 = L + λ M l ˜ .
where M l ˜ is a square matrix, and the detailed calculation procedure is provided in [50]. FDEFS can be regarded as an extension of RLSR by combining discriminant embedding terms and manifold embedding terms to enhance the discriminability of selected features.
(5) GS3FS is a robust graph-based SSFS method that selects relevant and sparse features through manifold learning and the L2p norm imposed on the regularization and loss functions.
min t r ( F T L F ) + t r ( ( F Y ) U ( F Y ) T ) + X T W + 1 n b T F | | 2 , p p + λ W | | 2 , p p s . t .   F 0 , W , p ( 0 , 1 ] .
Compared with the FDEFS method, GS3FS first integrates the LP into FDEFS. Moreover, GS3FS uses the L2p norm instead of the L21 norm to highlight the robustness of the selected features.
(6) S2LFS is a novel SSFS that can select different subsets for different categories rather than selecting one subset for all categories.
min k = 1 c g k X T w k | | 2 + λ k = 1 c w k T d i a g ( z k 1 ) w k + β ( t r ( G T L G ) + t r ( ( G Y ) T U ( G Y ) ) s . t .   G 0 , G T G = I c , z k 0 , z k T 1 d = 1 .
where z k is an indicator vector representing whether a feature is chosen or not for the k-th class, and w k is the prediction function for the k-th class based on the selected features.
(7) AGLRM uses AGL techniques to enhance similarity matrix construction and mitigate the adverse impact of redundant features by minimizing redundant terms.
min γ t r ( F T L F ) + t r ( ( F Y ) U ( F Y ) T ) + α | | S | | F 2 + t r ( W T X L W T X ) + θ t r ( W T A W ) + | | X T W + 1 n b T F | | F 2 + λ | | W | | 2 , 1 s . t .   0 S i j 1 , S i 1 n = 1 .
where A is a matrix of correlation coefficients for evaluating feature correlations.
Although the performance of the AGLRM method is superior to other methods, it still has shortcomings. First, the weight matrix of the graph is constrained by the L2 norm, which results in the graph lacking a sparse structure. Second, global constraints are not considered in the graph learning process, which leads to neglect of the distribution of the data and failure to explore more effective feature similarity metrics, thus affecting the performance of the method.
(8) ASLCGLFS improves similarity matrix quality by integrating label information into AGL. Additionally, it considers both local and global structures of the samples, thereby reducing redundancy in the selected features.
min | | X T W F | | F 2 + i j n | | W T ( X i X j ) | | 2 2 S i j + α | | S A | | F 2 + t r ( F T L F ) + t r ( ( F Y ) U ( F Y ) T ) + | | W T X W T X Z | | F 2 + β | | Z | | 2 , 1 + λ | | W | | 2 , 1 s . t .   0 S i j 1 , S i T 1 n = 1 , α , β , λ 0 .
As an improvement to AGLRM, ASLCGLFS considers global information. However, the introduction of a predefined similarity matrix may bring in redundant information, which affects the learning performance. Therefore, instead of introducing predefined matrices, we will consider using the introduction of brand-new constraints to learn global and local information and reduce redundancy in order to improve the performance of feature selection.

4.2. Classification Experiments

4.2.1. Classification Datasets

Five publicly available image datasets were used in the classification experiment, which includes four face classification datasets (AR [58], CMU PIE [59], Extended YaleB [60], ORL [61]) and one object classification dataset (COIL20 [62]). Table 5 presents the detailed information of these datasets, in which P1 and P2 indicate training and test samples per category, respectively.
The AR dataset is a widely used standard database consisting of more than 4000 color facial images. These images are from 126 faces, including 56 females and 70 males. The images in this dataset have variable expressions, lighting changes, and external occlusions. Figure 3a shows some images from this database.
The CMU PIE dataset consists of 41,368 grayscale facial images of 68 individuals. These images cover subjects of different ages, genders, and skin tones with different postural conditions, lighting environments, and expressions. Figure 3b shows some examples in this dataset.
The Extended YaleB dataset was taken from 38 subjects, and each subject was selected from 64 photos in different poses, different lighting environments, and 5 different shooting angles. This dataset has a total of 2414 face images. Figure 3c shows some images from the Extended YaleB dataset.
The ORL dataset contains 400 images of faces from 40 volunteers. Each volunteer provided 10 images with different facial postures, facial expressions, and facial ornaments obscured, such as serious or smiling, eyes up or squinting, and wearing or not wearing accessories. Some of the examples from this dataset can be observed in Figure 3d.
The COIL20 dataset comprises 1440 images featuring 20 different subjects. A total of 72 images were taken for each subject at 5-degree intervals. Some of the images from COIL20 are shown in Figure 3e.
It should be mentioned that in most existing work, these face databases (AR, CMU PIE, Extended YaleB, and ORL) are commonly used to evaluate the performance of the method because of the following aspects: (1) each database has different numbers of original data and categories; (2) each database contains different types of face variations; (3) each database has different conditions and environments for image acquisition. By using these classical facial datasets to evaluate our proposed method, we can ensure that our experimental results are adequately comparable to previous findings, thus better assessing the novelty and effectiveness of our proposed method in the field of face recognition.

4.2.2. Evaluation Metric

The accuracy rate [63] is employed to measure the performance of SFS-AGGL on the classification task, which is represented as:
A C C = T P + T N T P + F P + F N + T N × 100 %
where TP and TN represent the numbers of correctly identified positive and negative samples. Additionally, false positive (FP) and false negative (FN) signify the misclassification of negative samples as positive and positive samples as negative, respectively. A higher accuracy rate value indicates improved classification performance.

4.2.3. Experimental Setup for Classification Task

In this experiment, P1 samples are randomly selected from each class for training, and the remaining P2 samples are used for testing. Then, an FS model is used to select a limited number of relevant features from the training data, and the model’s effectiveness is assessed using KNN on the testing samples with only a subset of features. For the sake of experiment fairness and reliability, each experiment is conducted 10 times using diverse training data, and the final experimental results are represented as average classification accuracy and standard deviation. In addition, to select the optimal parameters, we used the grid search method to find the optimal values of parameters α, β, θ, and λ in the range {0.001, 0.01, 0.1, 1, 10, 100, 1000} and the optimal number of iterations a in {100, 200, 300, 400, 500, 600}. The dimensions of selected features vary from 50 to 500 in increments of 50.

4.2.4. Analysis of Classification Results

(1) Parameter sensitivity analysis of classification
The effects of feature dimension (d), number of iterations (iter), and four balance parameters (α, β, θ, λ) on the performance of SFS-AGGL in the classification task are investigated. To assess SFS-AGGL’s performance across varied experimental scenarios, the number of feature dimensions, iteration times, and the values of four balancing parameters were adjusted.
First, we have demonstrated the influence of different iteration times on the performance of SFS-AGGL, with the remaining parameters set to their optimal values. As shown in Figure 4, the classification accuracy varies with the iterations, showing an increasing trend. However, the classification accuracy will decrease or remain stable with an increasing iteration after reaching its peak. This demonstrates that SFS-AGGL can reduce the impact of noise and redundant features and effectively overcome overfitting problems.
Second, the performance of different methods in different feature dimensions is shown in Figure 5. From Figure 5, we can find that the accuracy obtained by all methods is relatively lower when the feature dimensions are smaller. On the contrary, the performance of all methods gradually improves as the number of selected features increases. In most cases, the proposed SFS-AGGL outperforms the comparison methods, which indicates the stronger discriminative ability of the features selected by SFS-AGGL. However, the performance of some methods decreases as the number of selected features increases. This may be due to the presence of redundant or noisy information features in higher dimensions. Nevertheless, SFS-AGGL still surpasses the comparison methods in classification accuracy. The experimental results further validate the enhanced robustness of the features chosen by the SFS-AGGL method.
Third, the performance of the proposed SFS-AGGL with different values of the four balancing parameters α, β, θ, and λ on different datasets is tested. The classification results for each balance parameter are depicted in Figure 6. From Figure 6, the following conclusions can be drawn:
(1) The parameter α is used to control LP. The performance of SFS-AGGL is very sensitive to parameter α on different datasets.
(2) The parameter β affects the performance of AGL. SFS-AGGL achieves the best performance when β is set to 0.01 for the AR dataset and β is set to 0.1 for other datasets. In addition, the classification accuracy of SFS-AGGL on the ORL dataset is insensitive to different values of β. In contrast, the classification performance is very sensitive to the parameter β on other datasets. Therefore, β should be set to a smaller value to obtain better classification results.
(3) The parameter θ determines the significance of the sparse feature projection terms. The performance of SFS-AGGL is insensitive to parameter θ on the ORL, COIL20, and AR datasets, but it is very sensitive on the Extended YaleB and CMU PIE datasets.
(4) The parameter λ determines the importance of global and local constraint terms. SFS-AGGL achieves high accuracy on each dataset when the value of λ is small. However, the performance of SFS-AGGL decreases with increasing λ for the CMU PIE, Extended YaleB, and AR datasets. This indicates that there is significant variation among intraclass samples in these datasets. Therefore, λ should be set to a smaller value in the case of large differences between intraclass samples.
In summary, different values of the balancing parameters will have different effects on different datasets. The optimal parameter combinations for each dataset are listed in Table 6.
(2) Comparative analysis of classification performance
First, this section validates the classification performance of SFS-AGGL compared to other methods on the five image datasets. Table 7 presents the optimal average classification accuracy and their corresponding standard deviations for the different methods. The results in Table 7 show that: (1) SSFS methods outperform the UFS method, which indicates that the guidance of a small number of labels is crucial to improving the performance; (2) the joint FS algorithms achieve better performances than that of the RLSR method, which indicates that the correlation information among features is important for improving the FS performance; (3) the semi-supervised methods RLSR and FDEFS are inferior to other semi-supervised methods, which demonstrates that introducing the LP algorithm into semi-supervised methods is favorable for selecting discriminative features; (4) the proposed SFS-AGGL method outperforms the ASLCGLF method, notably since it integrates global and local constraints into AGL. Therefore, it is beneficial to fully consider LP and AGL in the SSFS approach to improve performance.
Then, to demonstrate the superiority of SFS-AGGL, we employed one-tailed t-tests to determine if SFS-AGGL significantly outperformed the comparison methods. Both the null hypothesis and alternative hypotheses assumed that the results achieved by SFS-AGGL were equal to or greater than the results obtained by the comparison methods. For instance, in comparing SFS-AGGL with RLSR (SFS-AGGL vs. RLSR), the hypotheses are defined as H0: SFS-AGGL = RLSR and H1: SFS-AGGL > RLSR, where SFS-AGGL and RLSR represent average classification results obtained by SFS-AGGL and RLSR on different datasets, respectively. The experiment sets a statistical significance level of 0.05, and Table 8 presents the p values of pairwise one-tailed t-tests on different datasets.
From Table 8, it can be seen that the performance of all methods is comparable on ORL and COIL datasets since these two datasets are relatively simple compared with other datasets, but the accuracy of our method is still slightly higher than that of other methods. Moreover, for AR, CMU PIE, and Extended YaleB databases, our method was able to significantly outperform the other comparative methods, indicating that our method is more advantageous in dealing with complex datasets.

4.3. Clustering Experiments

This section validates the effectiveness of the SFS-AGGL method for clustering tasks. For this purpose, we used the face dataset ORL and the object dataset COIL20, as well as two UCI datasets (Libras Movement and Landsat [64]) in the experiment.

4.3.1. Clustering Datasets

The Libras Movement dataset contains 15 gestures with a total of 360 samples and 89 attributes, while the Landsat dataset contains multispectral images of six different geographic regions with a total of 296 samples and 36 attributes. The details of all clustering datasets used are shown in Table 9.

4.3.2. Evaluation Metrics

Multiple metrics, such as ACC, NMI, purity, ARI, F-score, precision, and recall [65], are applied to evaluate the clustering performance.
ACC represents clustering accuracy, which is defined as:
A C C = i = 1 n δ ( y i , m a p ( y ¯ i ) ) n  
where δ ( x , y ) = 1 ,   i f   x = y   0 ,   o t h e r w i s e , n is the total number of samples, y i and y ¯ i denote the ground truth label and clustering label of the i-th sample, respectively, and where m a p ( ) is a function that maps the learned clustering labels to align with the ground-truth labels.
NMI is the normalized mutual information for clustering, which is defined as:
N M I = M I ( H , V ) H ( U ) H ( V )
where MI denotes the mutual information, i.e., the entropy of the two sets, U and V. MI has been normalized to ensure fair comparisons between sets of different sizes.
ARI is the adjusted Rand index, which is defined as:
A R I = R I E x p e c t e d _ R I max ( R I _ m a x ) E x p e c t e d _ R I
where RI (Rand index) denotes the number of sample pairs that are correctly clustered by the clustering algorithm out of all sample pairs; Expected_RI denotes the expected Rand index obtained through random clustering; and max(RI_max) indicates the maximum possible Rand index. The RI is adjusted to account for randomness, with values ranging between −1 and 1, where a value closer to 1 indicates better clustering performance.
Purity measures the proportion of true categories that dominate each cluster.
P u r i t y = 1 N k max j | C k G j |
where C k denotes the k-th cluster, G j denotes the j-th true category, and N denotes total number of samples.
Precision reflects the ratio of correctly clustered positive samples to all samples identified as positive.
P r e c i s i o n = T P T P + F P
Recall indicates the proportion of positive samples that were correctly clustered with all actual positive samples.
R e c a l l = T P T P + F N
F-score is the harmonic mean of precision and recall, providing a comprehensive assessment of both performance metrics.
F s c o r e = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l  

4.3.3. Experimental Setup for Clustering

In this experiment, we set the four parameters (α, β, θ, and λ) with range {0.001, 0.01, 0.1, 1, 10, 100, 1000} for all datasets and the dimensions (d) with range {50, 100, 150, 200, 250, 300, 350, 400, 450, 500}, {8, 16, 24, 32, 40, 48, 56, 64, 72, 80}, and {3, 6, 9, 12, 15, 18, 21, 24, 27, 30} for different datasets, respectively.

4.3.4. Analysis of Clustering Results

(1) Parameter sensitivity analysis of clustering
Figure 7 illustrates the clustering results of SFS-AGGL on four datasets with varying parameters. When the selected feature dimension is unchanged, the parameter α first increases, then decreases, and finally rises again. The performance of SFS-AGGL is sensitive to different parameter values on different datasets, which underscores the importance of adjusting these values to achieve optimal clustering performance. Smaller values of regularization parameters β and λ can yield improved overall performance on diverse datasets. This demonstrates that our proposed SFS-AGGL can not only acquire neighboring information in the projected feature space but also capture the global and local sparse structures in the original feature space, ultimately leading to good performance. The performance of SFS-AGGL first improves and then decreases as the regularization parameter λ increases on the COIL20 and Landsat datasets. This indicates that SFS-AGGL is more sensitive to sparse learning in space. In summary, setting all balance parameters to smaller values enhances the clustering results of SFS-AGGL. Furthermore, it is advisable to adjust parameter values tailored to each dataset to achieve optimal outcomes.
Figure 8 shows the clustering results obtained by sequentially setting each balancing parameter to different values while keeping all other conditions at optimal values. It can be found that the performance of SFS-AGGL is insensitive to all parameters in most cases. Notably, the clustering accuracy of SFS-AGGL on the ORL dataset is relatively sensitive to an increase in the parameter β. Therefore, it is recommended to set β to a larger value for optimizing clustering performance.
(2) Comparative analysis of clustering performance
In this experiment, the k-means method is adopted to cluster the low-dimensional features selected by each FS method. To minimize the impact of initialization on the k-means method, we performed 10 clustering experiments with varied random initializations. Table 10, Table 11, Table 12 and Table 13 display the average values and standard deviations of ACC, NMI, purity, ARI, F-score, precision, and recall for the RLSR, FDEFS, GS3FS, S2LFS, AGLRM, ASLCGLFS, and SFS-AGGL methods on the ORL, COIL20, Libras Movement, and Landsat datasets. These results further illustrate the superiority of the proposed SFS-AGGL compared to other comparative methods.

4.4. Convergence and Runtime Analysis

In this section, experiments were performed on seven databases to assess the convergence and runtime of the proposed SFS-AGGL method. Figure 9 shows the convergence curve of SFS-AGGL. From Figure 9, we can see that the objective function values of the SFS-AGGL methods only require less than 100 iterations to reach convergence, which validates the efficiency of the proposed iterative optimization method. Table 14 displays the runtime of SFS-AGGL when iteration is set to 100 and feature dimensions are set to 500. The results in Table 14 clearly indicate that the runtime of our proposed method is slightly higher than that of AGLRM but lower than that of other methods. It is noteworthy that the runtime of SFS-AGGL is lower than that of all comparative methods after GPU optimization.

5. Conclusions and Discussion

This paper proposes the semi-supervised feature selection based on an adaptive graph with global and local constraints (SFS-AGGL) algorithm. This algorithm considers the sample neighborhood structure within the projected feature space, dynamically learns the optimal nearest neighbor graph among samples, and maintains global and local sparse structures within the selected feature subset. This ensures the preservation of the original data’s geometric structural information. Moreover, it can effectively leverage structural distribution information from labeled data to derive label information from unlabeled samples. The incorporation of the L21 norm in the SFS model enhances its resilience to noisy features. The iterative optimization approach employed to solve parameter optimal solutions is validated, confirming the convergence of the SFS-AGGL algorithm. Extensive experiments on real datasets validate the classification and clustering performance of the proposed SFS-AGGL method. Although our method can achieve good performance, there are still several issues that need to be pointed out, which are as follows:
1. Since the proposed method has considered the correlation and geometric structure of the data, it is suitable for the features of data with significant correlation, meanwhile, the distribution of data has a certain local structure.
2. Since our proposed method only considers the local and global structural information of the data, its application will be limited in certain datasets.
3. The proposed method cannot effectively extract effective features from the data with complex nonlinear structures because it is a linear feature selection method.
To overcome the above-mentioned shortcomings, we will try to do the following work in the future:
1. We will introduce other constraints to comprehensively capture and represent the structural information of the data.
2. We will integrate the idea of deep learning into the feature selection process to extract effective features from highly unstructured data.

Author Contributions

Data curation, Y.Y., H.Z., N.Z., G.X. and X.H.; Formal analysis, X.H., H.Z., N.Z., X.H. and G.X.; Methodology, Y.Y., H.Z., W.Z. and C.Z.; Resources, Y.Y., H.Z., N.Z. and W.Z.; Supervision, W.Z. and C.Z.; Writing—original draft, Y.Y. and H.Z.; Writing—review and editing, Y.Y., H.Z., W.Z. and C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by grants from the National Natural Science Foundation of China (Nos. 62062040 and 62006174), the Outstanding Youth Project of Jiangxi Natural Science Foundation (No. 20212ACB212003), the Jiangxi Province Key Subject Academic and Technical Leader Funding Project (No. 20212BCJ23017), the Science and Technology Research Project of Jiangxi Provincial Department of Education (No. GJJ210330), and the Fund of the Jilin Provincial Science and Technology Department (No. 20220201157GX).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data were derived from public domain resources.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wen, J.; Yang, S.; Wang, C.D.; Jiang, Y.; Li, R. Feature-splitting Algorithms for Ultrahigh Dimensional Quantile Regression. J. Econom. 2023, 2023, 105426. [Google Scholar] [CrossRef]
  2. Lue, X.; Long, L.; Deng, R.; Meng, R. Image feature extraction based on fuzzy restricted Boltzmann machine. Measurement 2022, 204, 112063. [Google Scholar] [CrossRef]
  3. Sheikhpour, R.; Sarram, M.A.; Gharaghani, S.; Chahooki, M.A.Z. A survey on semi-supervised feature selection methods. Pattern Recognit. 2017, 64, 141–158. [Google Scholar] [CrossRef]
  4. Mafarja, M.; Qasem, A.; Heidari, A.A.; Aljarah, I.; Faris, H.; Mirjalili, S. Efficient hybrid nature-inspired binary optimizers for feature selection. Cogn. Comput. 2020, 12, 150–175. [Google Scholar] [CrossRef]
  5. Huang, G.Y.; Hung, C.Y.; Chen, B.W. Image feature selection based on orthogonal ℓ2,0 norms. Measurement 2022, 199, 111310. [Google Scholar] [CrossRef]
  6. Cai, J.; Luo, J.; Wang, S.; Yang, S. Feature selection in machine learning: A new perspective. Neurocomputing 2018, 300, 70–79. [Google Scholar] [CrossRef]
  7. Solorio-Fernández, S.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F. A systematic evaluation of filter Unsupervised Feature Selection methods. Expert Syst. Appl. 2020, 162, 113745. [Google Scholar] [CrossRef]
  8. Bhadra, T.; Bandyopadhyay, S. Supervised feature selection using integration of densest subgraph finding with floating forward–backward search. Inf. Sci. 2021, 566, 1–18. [Google Scholar] [CrossRef]
  9. Mann, G.S.; McCallum, A. Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data. J. Mach. Learn. Res. 2010, 11, 955–984. [Google Scholar]
  10. Hou, C.; Nie, F.; Li, X.; Yi, D.; Wu, Y. Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Trans. Cybern. 2013, 44, 793–804. [Google Scholar]
  11. Wang, L.; Jiang, S.; Jiang, S. A feature selection method via analysis of relevance, redundancy, and interaction. Expert Syst. Appl. 2021, 183, 115365. [Google Scholar] [CrossRef]
  12. Dokeroglu, T.; Deniz, A.; Kiziloz, H.E. A comprehensive survey on recent metaheuristics for feature selection. Neurocomputing 2022, 494, 2966. [Google Scholar] [CrossRef]
  13. Nie, F.; Zhu, W.; Li, X. Structured graph optimization for unsupervised feature selection. IEEE Trans. Knowl. Data Eng. 2019, 33, 1210–1222. [Google Scholar] [CrossRef]
  14. Zhao, Z.; Liu, H. Semi-supervised feature selection via spectral analysis. In Proceedings of the 2007 SIAM International Conference on Data Mining; Society for Industrial and Applied Mathematics, Minneapolis, MN, USA, 26–28 April 2007; pp. 641–646. [Google Scholar]
  15. Toğaçar, M.; Ergen, B.; Cömert, Z. Classification of flower species by using features extracted from the intersection of feature selection methods in convolutional neural network models. Measurement 2020, 158, 107703. [Google Scholar] [CrossRef]
  16. Chen, X.; Song, L.; Hou, Y.; Shao, G. Efficient semi-supervised feature selection for VHR remote sensing images. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 1500–1503. [Google Scholar]
  17. Peng, S.; Lu, J.; Cao, J.; Peng, Q.; Yang, Z. Adaptive graph regularization method based on least square regression for clustering. Signal Process. Image Commun. 2023, 114, 116938. [Google Scholar] [CrossRef]
  18. Chang, X.; Nie, F.; Yang, Y.; Huang, H. A convex formulation for semi-supervised multi-label feature selection. In Proceedings of the AAAI Conference on Artificial Intelligence, Québec City, QC, Canada, 27–31 July 2014; Volume 28. [Google Scholar]
  19. Chen, X.; Yuan, G.; Nie, F.; Huang, J.Z. Semi-supervised feature selection via rescaled linear regression. In Proceedings of the Twenty Sixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 1525–1531. [Google Scholar]
  20. Chen, X.; Chen, R.; Wu, Q.; Nie, F.; Yang, M.; Mao, R. Semi supervised feature selection via structured manifold learning. IEEE Trans. Cybern. 2021, 52, 5756–5766. [Google Scholar] [CrossRef]
  21. Liu, Z.; Lai, Z.; Ou, W.; Zhang, K.; Zheng, R. Structured optimal graph based sparse feature extraction for semi-supervised learning. Signal Process. 2020, 170, 107456. [Google Scholar] [CrossRef]
  22. Akbar, S.; Hayat, M.; Tahir, M.; Chong, K.T. cACP-2LFS: Classification of anticancer peptides using sequential discriminative model of KSAAP and two-level feature selection approach. IEEE Access 2020, 8, 131939–131948. [Google Scholar] [CrossRef]
  23. Bakir-Gungor, B.; Hacilar, H.; Jabeer, A.; Nalbantoglu, O.U.; Aran, O.; Yousef, M. Inflammatory bowel disease biomarkers of human gut microbiota selected via ensemble feature selection methods. PeerJ 2022, 10, e13205. [Google Scholar] [CrossRef]
  24. Ahmed, N.; Rafiq, J.I.; Islam, M.R. Enhanced human activity recognition based on smartphone sensor data using hybrid feature selection model. Sensors 2020, 20, 317. [Google Scholar] [CrossRef]
  25. López, D.; Ramírez-Gallego, S.; García, S.; Xiong, N.; Herrera, F. BELIEF: A distance-based redundancy-proof feature selection method for Big Data. Inf. Sci. 2021, 558, 124–139. [Google Scholar] [CrossRef]
  26. Chen, X.; Yuan, G.; Wang, W.; Nie, F.; Chang, X.; Huang, J.Z. Local adaptive projection framework for feature selection of labeled and unlabeled data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 6362–6373. [Google Scholar] [CrossRef] [PubMed]
  27. Cheng, B.; Yang, J.; Yan, S.; Fu, Y.; Huang, T.S. Learning with l1-graph for image analysis. IEEE Trans. Image Process. 2009, 19, 858–866. [Google Scholar] [CrossRef]
  28. Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 171–184. [Google Scholar] [CrossRef]
  29. Singh, R.P.; Ojha, D.; Jadon, K.S. A Survey on Various Representation Learning of Hypergraph for Unsupervised Feature Selection. In Data, Engineering and Applications: Select Proceedings of IDEA 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 71–82. [Google Scholar]
  30. Elhamifar, E.; Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2765–2781. [Google Scholar] [CrossRef]
  31. Zhong, G.; Pun, C.M. Subspace clustering by simultaneously feature selection and similarity learning. Knowl. Based Syst. 2020, 193, 105512. [Google Scholar] [CrossRef]
  32. Wan, Y.; Sun, S.; Zeng, C. Adaptive similarity embedding for unsupervised multi-view feature selection. IEEE Trans. Knowl. Data Eng. 2020, 33, 3338–3350. [Google Scholar] [CrossRef]
  33. Shang, R.; Song, J.; Jiao, L.; Li, Y. Double feature selection algorithm based on low-rank sparse non-negative matrix factorization. Int. J. Mach. Learn. Cybern. 2020, 11, 1891–1908. [Google Scholar] [CrossRef]
  34. Zhu, J.; Jang-Jaccard, J.; Liu, T.; Zhou, J. Joint spectral clustering based on optimal graph and feature selection. Neural Process. Lett. 2021, 53, 257–273. [Google Scholar] [CrossRef]
  35. Sha, Y.; Faber, J.; Gou, S.; Liu, B.; Li, W.; Schramm, S.; Stoecker, H.; Steckenreiter, T.; Vnucec, D.; Wetzstein, N.; et al. An acoustic signal cavitation detection framework based on XGBoost with adaptive selection feature engineering. Measurement 2022, 192, 110897. [Google Scholar] [CrossRef]
  36. Zhu, P.; Hou, X.; Tang, K.; Liu, Y.; Zhao, Y.P.; Wang, Z. Unsupervised feature selection through combining graph learning and ℓ2, 0-norm constraint. Inf. Sci. 2023, 622, 68–82. [Google Scholar] [CrossRef]
  37. Mei, S.; Zhao, W.; Gao, Q.; Yang, M.; Gao, X. Joint feature selection and optimal bipartite graph learning for subspace clustering. Neural Netw. 2023, 164, 408–418. [Google Scholar] [CrossRef] [PubMed]
  38. Zhou, P.; Du, L.; Li, X.; Shen, Y.D.; Qian, Y. Unsupervised feature selection with adaptive multiple graph learning. Pattern Recognit. 2020, 105, 107375. [Google Scholar] [CrossRef]
  39. Bai, X.; Zhu, L.; Liang, C.; Li, J.; Nie, X.; Chang, X. Multi-view feature selection via nonnegative structured graph learning. Neurocomputing 2020, 387, 110–122. [Google Scholar] [CrossRef]
  40. Zhou, P.; Chen, J.; Du, L.; Li, X. Balanced spectral feature selection. IEEE Trans. Cybern. 2022, 53, 4232–4244. [Google Scholar] [CrossRef]
  41. Miao, J.; Yang, T.; Sun, L.; Fei, X.; Niu, L.; Shi, Y. Graph regularized locally linear embedding for unsupervised feature selection. Pattern Recognit. 2022, 122, 108299. [Google Scholar] [CrossRef]
  42. Xie, G.B.; Chen, R.B.; Lin, Z.Y.; Gu, G.S.; Yu, J.R.; Liu, Z.; Cui, J.; Lin, L.; Chen, L. Predicting lncRNA–disease associations based on combining selective similarity matrix fusion and bidirectional linear neighborhood label propagation. Brief. Bioinform. 2023, 24, bbac595. [Google Scholar] [CrossRef]
  43. Sheikhpour, R.; Sarram, M.A.; Gharaghani, S.; Chahooki, M.A.Z. A robust graph-based semi-supervised sparse feature selection method. Inf. Sci. 2020, 531, 13–30. [Google Scholar] [CrossRef]
  44. Li, Z.; Tang, J. Semi-supervised local feature selection for data classification. Sci. China Inf. Sci. 2021, 64, 192108. [Google Scholar] [CrossRef]
  45. Jiang, B.; Wu, X.; Zhou, X.; Liu, Y.; Cohn, A.G.; Sheng, W.; Chen, H. Semi-supervised multiview feature selection with adaptive graph learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–15. [Google Scholar] [CrossRef]
  46. Shang, R.; Zhang, X.; Feng, J.; Li, Y.; Jiao, L. Sparse and low-dimensional representation with maximum entropy adaptive graph for feature selection. Neurocomputing 2022, 485, 57–73. [Google Scholar] [CrossRef]
  47. Lai, J.; Chen, H.; Li, T.; Yang, X. Adaptive graph learning for semi-supervised feature selection with redundancy minimization. Inf. Sci. 2022, 609, 465–488. [Google Scholar] [CrossRef]
  48. Lai, J.; Chen, H.; Li, W.; Li, T.; Wan, J. Semi-supervised feature selection via adaptive structure learning and constrained graph learning. Knowl.-Based Syst. 2022, 251, 109243. [Google Scholar] [CrossRef]
  49. Luo, T.; Hou, C.; Nie, F.; Tao, H.; Yi, D. Semi-supervised feature selection via insensitive sparse regression with application to video semantic recognition. IEEE Trans. Knowl. Data Eng. 2018, 30, 1943–1956. [Google Scholar] [CrossRef]
  50. Moosaei, H.; Hladík, M. Sparse solution of least-squares twin multi-class support vector machine using ℓ0 and ℓp-norm for classification and feature selection. Neural Netw. 2023, 166, 471–486. [Google Scholar] [CrossRef]
  51. Favati, P.; Lotti, G.; Menchi, O.; Romani, F. Construction of the similarity matrix for the spectral clustering method: Numerical experiments. J. Comput. Appl. Math. 2020, 375, 112795. [Google Scholar] [CrossRef]
  52. Qu, J.; Zhao, X.; Xiao, Y.; Chang, X.; Li, Z.; Wang, X. Adaptive Manifold Graph representation for Two-Dimensional Discriminant Projection. Knowl.-Based Syst. 2023, 266, 110411. [Google Scholar] [CrossRef]
  53. Ma, Z.; Wang, J.; Li, H.; Huang, Y. Adaptive graph regularized non-negative matrix factorization with self-weighted learning for data clustering. Appl. Intell. 2023, 53, 28054–28073. [Google Scholar] [CrossRef]
  54. Yang, S.; Wen, J.; Zhan, X.; Kifer, D. ET-lasso: A new efficient tuning of lasso-type regularization for high-dimensional data. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 607–616. [Google Scholar]
  55. Huang, S.; Xu, Z.; Wang, F. Nonnegative matrix factorization with adaptive neighbors. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 486–493. [Google Scholar]
  56. Zhou, W.; Wu, C.; Yi, Y.; Luo, G. Structure preserving non-negative feature self-representation for unsupervised feature selection. IEEE Access 2017, 5, 8792–8803. [Google Scholar] [CrossRef]
  57. Shang, R.; Zhang, W.; Lu, M.; Jiao, L.; Li, Y. Feature selection based on non-negative spectral feature learning and adaptive rank constraint. Knowl.-Based Syst. 2022, 236, 107749. [Google Scholar] [CrossRef]
  58. Martinez, A.; Benavente, R. The AR Face Database: CVC Technical Report; Computer Vision Center: Barcelona, Spain, 1998; Volume 24. [Google Scholar]
  59. Sim, T.; Baker, S.; Bsat, M. The CMU pose, illumination, and expression (PIE) database. In Proceedings of the Fifth IEEE International Conference on Automatic Face Gesture Recognition, Washington, DC, USA, 20–21 May 2002; pp. 53–58. [Google Scholar]
  60. Zhang, L.; Zhang, L.; Zhang, D.; Zhu, H. Online finger-knuckle-print verification for personal authentication. Pattern Recognit. 2010, 43, 2560–2571. [Google Scholar] [CrossRef]
  61. Samaria, F.S.; Harter, A.C. Parameterisation of a stochastic model for human face identification. In Proceedings of the 1994 IEEE Workshop on Applications of Computer Vision, Seattle, WA, USA, 21–23 June 1994; pp. 138–142. [Google Scholar]
  62. Nene, S.A.; Nayar, S.K.; Murase, H. Columbia Object Image Library (COIL-20); Columbia University: New York, NY, USA, 1996. [Google Scholar]
  63. Yi, Y.; Lai, S.; Li, S.; Dai, J.; Wang, W.; Wang, J. RRNMF-MAGL: Robust regularization non-negative matrix factorization with multi-constraint adaptive graph learning for dimensionality reduction. Inf. Sci. 2023, 640, 119029. [Google Scholar] [CrossRef]
  64. Blake, C.L.; Merz, C.J. UCI Repository of Machine Learning Databases; Department of Information and Computer Science, University of California: Irvine, CA, USA, 1998; p. 55. [Google Scholar]
  65. Li, Z.; Tang, C.; Zheng, X.; Liu, X.; Zhang, W.; Zhu, E. High-order correlation preserved incomplete multi-view subspace clustering. IEEE Trans. Image Process. 2022, 31, 2067–2080. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The illustration of the SFS-AGGL framework.
Figure 1. The illustration of the SFS-AGGL framework.
Information 15 00057 g001
Figure 2. Flow chart of SFS-AGGL algorithm.
Figure 2. Flow chart of SFS-AGGL algorithm.
Information 15 00057 g002
Figure 3. Sample images of five datasets.
Figure 3. Sample images of five datasets.
Information 15 00057 g003
Figure 4. Classification accuracy of SFS-AGGL under different iterations.
Figure 4. Classification accuracy of SFS-AGGL under different iterations.
Information 15 00057 g004
Figure 5. Classification accuracy of different methods under different feature dimensions.
Figure 5. Classification accuracy of different methods under different feature dimensions.
Information 15 00057 g005
Figure 6. Classification results of SFS-AGGL under different parameter values.
Figure 6. Classification results of SFS-AGGL under different parameter values.
Information 15 00057 g006
Figure 7. Clustering results of SFS-AGGL under different parameter values and different feature dimensions, where different colors represent different feature dimensions.
Figure 7. Clustering results of SFS-AGGL under different parameter values and different feature dimensions, where different colors represent different feature dimensions.
Information 15 00057 g007aInformation 15 00057 g007bInformation 15 00057 g007c
Figure 8. Clustering results of SFS-AGGL under different parameter values.
Figure 8. Clustering results of SFS-AGGL under different parameter values.
Information 15 00057 g008
Figure 9. Convergence curves of SFS-AGGL on different datasets.
Figure 9. Convergence curves of SFS-AGGL on different datasets.
Information 15 00057 g009aInformation 15 00057 g009b
Table 1. Definition of the main symbols in this paper.
Table 1. Definition of the main symbols in this paper.
NotationDescriptionNotationDescription
X R d × n Sample matrix 0 R u × c Zero matrix
X l R d × l Labeled sample matrix d Sample dimension
X u R d × ( n l ) Unlabeled sample matrix n Sample size
Y R n × c Label matrix k Number of selected features
F R n × c Predictive labeling matrix c Number of categories
S R n × n Weighting matrix l Number of label samples
E R n × n Local adaptation matrix + Infinitely large numbers
W R d × c Weighting matrix Matrix dot product
I R u × u Unit matrix t r ( · ) Traces of matrix
Table 2. The time complexity of each matrix in our proposed algorithm.
Table 2. The time complexity of each matrix in our proposed algorithm.
MatrixFormulaTime Complexity
U U = [ u i i ] R n × n O ( n 2 )
E E = [ e i j ] R n × n O ( k n 2 )
W W i j = W i j [ X F + 2 β X S T X T W ] i j [ X X T W + β X X T W + β X S S T X T W + θ H W ] i j O ( c m n )
F F i j = F i j [ X T W Y U T ] i j [ F + α L F + F U ] i j O ( c m n )
S S i j = S i j [ α F F T + 2 β X T W W T X ] i j [ 2 β X T W W T X S + λ E ] i j O ( c n 2 )
Table 3. Computational complexity of each iteration for FS methods.
Table 3. Computational complexity of each iteration for FS methods.
MethodNumber of VariablesAlgorithm Complexity
RLSR [19]2 O ( i t e r × max ( n d c , n 3 ) )
FDEFS [49]3 O ( max ( c m n , c n 2 ) )
GS3FS [43]4 O ( i t e r × max ( d 3 , n 3 ) )
S2LFS [44]3 O ( c d 2 n + c d 3 + c n 2 )
AGLRM [47]4 O ( i t e r × max ( d 3 , n 3 ) )
ASLCGLFS [48]4 O ( i t e r × max ( n 3 , d 3 ) )
SFS-AGGL3 O ( max ( k n 2 , c n 2 ) + i t e r × max ( c m n , n 2 ) )
Table 4. First- and second-order derivatives of each formula.
Table 4. First- and second-order derivatives of each formula.
W ψ i j ( W i j ) ψ i j ( W i j ) = [ X T W W T X 2 F W T X + β W T X X T W 2 β W T X S T X T W + β W T X S S T X T W + θ W T H W ] i j
ψ i j ( W i j ) ψ i j ( W i j ) = 2 [ X X T W X F + β X X T W 2 β X S T X T W + β X S S T X T W + θ H W ] i j
ψ i j ( W i j ) ψ i j ( W i j ) = 2 [ X X T + β X X T 2 β X S X T + β X S S T X T + θ H T ] i i
F ψ i j ( F i j ) ψ i j ( F i j ) = [ 2 F W T X + F F T + α F T L F + F U F T 2 F U Y T ] i j
ψ i j ( F i j ) ψ i j ( F i j ) = 2 [ X T W + F + α L F + F U Y U T ] i j
ψ i j ( F i j ) ψ i j ( F i j ) = 2 [ E + α L T + U T ] i i
S ψ i j ( S i j ) ψ i j ( S i j ) = [ α F T S F 2 β W T X S T X T W + β W T X S S T X T W + λ S E ] i j
ψ i j ( S i j ) ψ i j ( S i j ) = [ α F F T 2 β X T W W T X + 2 β X T W W T X S + λ E ] i j
ψ i j ( S i j ) ψ i j ( S i j ) = 2 [ 2 β X T W W T X ] i i
Table 5. Details of the five image datasets.
Table 5. Details of the five image datasets.
DatasetSize of ImageSize of ClassesSize per ClassP1P2Type
AR32 × 321001477Face
CMU PIE32 × 3268241212Face
Extended YaleB32 × 3238642044Face
ORL32 × 32401073Face
COIL2032 × 3220722052Object
Table 6. Optimal parameter combination for SFS-AGGL on the five datasets.
Table 6. Optimal parameter combination for SFS-AGGL on the five datasets.
Dataset{d, t, α, β, θ, λ}
AR{400, 200, 1, 0.01, 0.01, 0.001}
CMU PIE{200, 200, 10, 0.1, 0.001, 0.001}
Extended YaleB{300, 100, 10, 0.1, 10, 0.001}
ORL{500, 100, 1000, 0.1, 0.001, 0.001}
COIL20{150, 100, 0.1, 0.1, 10, 1}
Table 7. Best results of each method on five image datasets (ACC).
Table 7. Best results of each method on five image datasets (ACC).
MethodARCMU PIEExtended YaleBORLCOIL20
NNSAFS63.90 ± 2.12 (400)85.29 ± 0.64 (500)62.58 ± 1.39 (300)92.17 ± 1.81 (500)93.56 ± 1.32 (100)
SPNFSR64.50 ± 0.91 (200)86.22 ± 1.02 (300)64.02 ± 1.95 (300)92.83 ± 1.81 (500)94.21 ± 1.42 (400)
RLSR64.37 ± 1.58 (500)84.66 ± 1.25 (500)64.57 ± 0.87 (300)95.67 ± 1.75 (500)93.73 ± 1.25 (500)
FDEFS63.51 ± 1.29 (500)85.85 ± 1.06 (500)65.01 ± 1.00 (500)96.25 ± 1.37 (500)94.35 ± 1.42 (450)
GS3FS63.90 ± 1.37 (450)85.83 ± 0.68 (500)61.85 ± 1.18 (500)96.25 ± 1.48 (450)93.38 ± 1.35 (500)
S2LFS64.20 ± 1.48 (500)87.50 ± 0.85 (500)64.67 ± 0.82 (500)96.42 ± 1.62 (400)94.95 ± 1.18 (500)
AGLRM64.39 ± 1.58 (450)86.90 ± 0.72 (450)61.89 ± 1.13 (500)96.08 ± 1.42 (500)95.09 ± 1.48 (200)
ASLCGLFS67.07 ± 1.62 (250)87.71 ± 1.12 (150)64.36 ± 1.19 (400)96.25 ± 1.37 (500)95.31 ± 1.13 (100)
SFS-AGGL68.03 ± 1.58(400)88.97 ± 1.11(200)66.35 ± 1.22 (300)96.42 ± 1.31(500)95.80 ± 1.16(500)
Numbers in parentheses denote the feature dimensions yielding the optimal results.
Table 8. p values of the pairwise one-tailed t-tests on five image datasets.
Table 8. p values of the pairwise one-tailed t-tests on five image datasets.
MethodARCMU PIEExtended YaleBORLCOIL20
RLSR vs. SFS-AGGL3.14 × 10−54.40 × 10−86.97 × 10−47.03 × 10−16.24 × 10−4
FDEFS vs. SFS-AGGL7.82 × 10−71.03 × 10−60.74 × 10−29.44 × 10−11.12 × 10−2
GS3FS vs. SFS-AGGL3.36 × 10−66.90 × 10−85.95 × 10−89.36 × 10−12.14 × 10−4
S2LFS vs. SFS-AGGL1.29 × 10−51.10 × 10−39.62 × 10−49.53 × 10−16.23 × 10−2
AGLRM vs. SFS-AGGL1.55 × 10−51.96 × 10−55.10 × 10−88.84 × 10−11.23 × 10−1
ASLCGLFS vs. SFS-AGGL9.87 × 10−27.50 × 10−38.17 × 10−49.44 × 10−11.76 × 10−1
Table 9. Details of four clustering datasets.
Table 9. Details of four clustering datasets.
DatasetNumber of SamplesDimensionCategory
ORL400102440
COIL201440102420
Libras Movement3608915
Landsat296366
Table 10. The best clustering results of different methods on ORL dataset.
Table 10. The best clustering results of different methods on ORL dataset.
MethodACCNMIPurityARIF-ScorePrecisionRecall
RLSR62.79 ± 2.89 (500)81.04 ± 1.83 (500)66.93 ± 2.19 (500)49.88 ± 3.78 (100)51.13 ± 3.64 (100)44.28 ± 4.33 (100)60.75 ± 2.65 (500)
FDEFS62.82 ± 3.69 (200)81.27 ± 1.59 (100)67.25 ± 3.08 (100)50.13 ± 3.71 (100)51.37 ± 3.60 (100)44.55 ± 3.85 (100)60.88 ± 3.64 (50)
GS3FS62.21 ± 1.55 (50)80.99 ± 0.74 (50)66.18 ± 1.33 (50)49.86 ± 1.58 (50)51.11 ± 1.53 (50)44.17 ± 1.79 (50)60.79 ± 2.32 (150)
S2LFS61.93 ± 3.35 (350)80.62 ± 1.45 (350)66.82 ± 2.37 (350)48.55 ± 3.61 (350)49.82 ± 3.51 (350)43.55 ± 3.60 (350)58.74 ± 4.44 (400)
AGLRM64.21 ± 3.70 (50)81.84 ± 1.89 (50)68.00 ± 3.14 (50)51.16 ± 4.40 (50)52.36 ± 4.26 (50)45.80 ± 4.72 (50)61.29 ± 4.00 (50)
ASLCGLFS58.32 ± 3.68 (250)78.56 ± 2.33 (250)63.32 ± 3.10 (250)44.22 ± 5.03 (250)45.62 ± 4.85 (250)39.37 ± 5.58 (300)54.62 ± 3.95 (300)
SFS-AGGL67.96 ± 2.30 (250)84.17 ± 1.50 (400)71.89 ± 1.94 (500)56.89 ± 3.34 (400)57.95 ± 3.25 (400)50.69 ± 3.47 (400)67.71 ± 3.11 (400)
Table 11. The best clustering results of different methods on COIL20 dataset.
Table 11. The best clustering results of different methods on COIL20 dataset.
MethodACCNMIPurityARIF-ScorePrecisionRecall
RLSR60.45 ± 3.98 (150)72.19 ± 2.00 (250)63.27 ± 3.19 (50)50.84 ± 3.42 (300)53.42 ± 3.19 (300)48.67 ± 3.97 (300)59.38 ± 2.56 (50)
FDEFS58.52 ± 3.44 (50)70.67 ± 2.62 (400)61.55 ± 3.23 (50)48.14 ± 4.12 (50)50.93 ± 3.85 (50)45.32 ± 4.36 (50)58.42 ± 3.27 (150)
GS3FS59.98 ± 2.85 (150)72.31 ± 1.41 (250)63.38 ± 2.52 (150)50.23 ± 1.84 (250)52.86 ± 1.70 (250)47.65 ± 2.60 (250)59.47 ± 1.61 (250)
S2LFS58.42 ± 3.90 (250)70.19 ± 3.15 (250)61.58 ± 3.61 (250)46.40 ± 5.23 (450)49.41 ± 4.74 (450)42.64 ± 6.35 (250)59.72 ± 2.47 (450)
AGLRM59.85 ± 4.34 (150)72.17 ± 2.59 (150)63.17 ± 4.12 (150)50.03 ± 4.46 (150)52.71 ± 4.15 (150)47.16 ± 5.17 (150)60.05 ± 3.18 (300)
ASLCGLFS60.02 ± 3.59 (50)71.52 ± 1.67 (50)62.95 ± 3.35 (50)50.05 ± 2.56 (50)52.67 ± 2.34 (50)48.11 ± 3.81 (100)58.55 ± 1.90 (50)
SFS-AGGL61.88 ± 3.70 (350)73.30 ± 1.78 (500)64.67 ± 3.62 (350)52.37 ± 1.80 (500)54.85 ± 1.69 (500)50.36 ± 3.16 (200)62.16 ± 2.28 (500)
Table 12. The best clustering results of different methods on Libras Movement dataset.
Table 12. The best clustering results of different methods on Libras Movement dataset.
MethodACCNMIPurityARIF-ScorePrecisionRecall
RLSR47.50 ± 2.21 (40)60.07 ± 2.13 (56)50.00 ± 1.85 (56)30.04 ± 2.88 (56)34.82 ± 2.69 (56)31.38 ± 2.56 (56)39.22 ± 3.70 (56)
FDEFS46.33 ± 3.22 (32)60.36 ± 2.88 (32)50.22 ± 2.60 (24)30.73 ± 3.77 (72)35.58 ± 3.41 (72)31.60 ± 3.65 (32)41.28 ± 3.95 (72)
GS3FS46.72 ± 3.24 (80)60.57 ± 1.97 (56)50.94 ± 2.24 (80)31.20 ± 2.68 (56)36.01 ± 2.53 (56)31.79 ± 2.32 (80)41.93 ± 3.79 (56)
S2LFS46.56 ± 1.92 (80)59.95 ± 1.12 (64)50.72 ± 1.34 (64)30.28 ± 1.80 (80)35.13 ± 1.82 (80)30.97 ± 1.17 (80)40.82 ± 4.28 (80)
AGLRM46.00 ± 2.89 (56)60.35 ± 1.32 (56)50.72 ± 1.87 (72)30.80 ± 1.92 (56)35.62 ± 1.88 (56)31.42 ± 1.45 (56)41.33 ± 4.05 (56)
ASLCGLFS46.28 ± 3.41 (40)59.72 ± 2.30 (40)50.17 ± 2.33 (40)29.93 ± 2.97 (40)34.84 ± 2.77 (40)30.60 ± 2.68 (40)41.04 ± 4.67 (80)
SFS-AGGL49.22 ± 2.88 (72)62.33 ± 2.34 (72)53.11 ± 2.70 (72)33.04 ± 2.65 (56)37.75 ± 2.46 (56)33.57 ± 3.13 (72)44.42 ± 4.83 (80)
Table 13. The best clustering results of different methods on Landsat dataset.
Table 13. The best clustering results of different methods on Landsat dataset.
MethodACCNMIPurityARIF-ScorePrecisionRecall
RLSR48.30 ± 2.10 (6)45.97 ± 1.02 (18)50.59 ± 1.88 (18)33.94 ± 1.46 (27)47.53 ± 1.91 (27)38.53 ± 1.49 (3)63.72 ± 8.11 (27)
FDEFS47.89 ± 2.65 (30)45.68 ± 1.59 (30)50.60 ± 2.30 (30)33.49 ± 1.83 (30)47.10 ± 1.99 (24)38.12 ± 1.25 (30)63.03 ± 6.57 (18)
GS3FS49.10 ± 1.88 (15)46.34 ± 1.05 (30)51.37 ± 1.82 (21)34.02 ± 1.33 (21)47.69 ± 1.56 (21)38.23 ± 1.33 (30)64.44 ± 6.57 (21)
S2LFS47.81 ± 2.63 (15)45.99 ± 1.27 (30)49.86 ± 2.53 (15)34.03 ± 0.82 (15)47.46 ± 1.24 (15)38.83 ± 1.60 (30)62.37 ± 7.02 (15)
AGLRM49.06 ± 3.14 (9)45.83 ± 1.40 (15)50.97 ± 3.01 (9)34.03 ± 1.62 (27)47.23 ± 2.13 (27)39.39 ± 1.48 (15)60.78 ± 8.67 (27)
ASLCGLFS48.79 ± 1.78 (30)46.51 ± 1.41 (18)50.59 ± 2.03 (24)34.77 ± 0.83 (21)48.33 ± 0.99 (21)38.67 ± 1.17 (30)65.35 ± 4.89 (21)
SFS-AGGL51.02 ± 1.99 (12)47.21 ± 1.18 (18)52.81 ± 1.76 (12)35.49 ± 1.23 (27)49.04 ± 1.23 (18)40.17 ± 1.32 (30)69.26 ± 4.51 (15)
The numbers in parentheses denote the feature dimensions that yield the optimal results.
Table 14. Runtime(s) of different methods on different datasets.
Table 14. Runtime(s) of different methods on different datasets.
MethodARExtended YaleBCMU PIEORLCOIL20
RLSR28.370628.828627.040727.054523.2237
FDEFS154.3282190.8570210.935755.669971.4754
GS3FS41.555044.396649.717428.242727.0868
S2LFS52.749252.170354.447250.372747.9846
AGLRM11.797413.174415.93423.42924.5080
ASLCGLFS2690.22222477.49073735.1528126.7054340.7762
SFS-AGGL15.527713.784317.43845.52046.7449
SFS-AGGL(GPU)10.20699.312811.08433.37133.9999
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yi, Y.; Zhang, H.; Zhang, N.; Zhou, W.; Huang, X.; Xie, G.; Zheng, C. SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information. Information 2024, 15, 57. https://doi.org/10.3390/info15010057

AMA Style

Yi Y, Zhang H, Zhang N, Zhou W, Huang X, Xie G, Zheng C. SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information. Information. 2024; 15(1):57. https://doi.org/10.3390/info15010057

Chicago/Turabian Style

Yi, Yugen, Haoming Zhang, Ningyi Zhang, Wei Zhou, Xiaomei Huang, Gengsheng Xie, and Caixia Zheng. 2024. "SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information" Information 15, no. 1: 57. https://doi.org/10.3390/info15010057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop