SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information

Yi, Yugen; Zhang, Haoming; Zhang, Ningyi; Zhou, Wei; Huang, Xiaomei; Xie, Gengsheng; Zheng, Caixia

doi:10.3390/info15010057

Open AccessArticle

SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information

¹

School of Software, Jiangxi Normal University, Nanchang 330022, China

²

College of Computer Science, Shenyang Aerospace University, Shenyang 110136, China

³

College of Information Science and Technology, Northeast Normal University, Changchun 130117, China

^*

Authors to whom correspondence should be addressed.

Information 2024, 15(1), 57; https://doi.org/10.3390/info15010057

Submission received: 1 November 2023 / Revised: 30 December 2023 / Accepted: 14 January 2024 / Published: 17 January 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

As the feature dimension of data continues to expand, the task of selecting an optimal subset of features from a pool of limited labeled data and extensive unlabeled data becomes more and more challenging. In recent years, some semi-supervised feature selection methods (SSFS) have been proposed to select a subset of features, but they still have some drawbacks limiting their performance, for e.g., many SSFS methods underutilize the structural distribution information available within labeled and unlabeled data. To address this issue, we proposed a semi-supervised feature selection method based on an adaptive graph with global and local constraints (SFS-AGGL) in this paper. Specifically, we first designed an adaptive graph learning mechanism that can consider both the global and local information of samples to effectively learn and retain the geometric structural information of the original dataset. Secondly, we constructed a label propagation technique integrated with the adaptive graph learning in SFS-AGGL to fully utilize the structural distribution information of both labeled and unlabeled data. The proposed SFS-AGGL method is validated through classification and clustering tasks across various datasets. The experimental results demonstrate its superiority over existing benchmark methods, particularly in terms of clustering performance.

Keywords:

semi-supervised learning; feature selection; adaptive graph learning; sparse regularization; label propagation

1. Introduction

High-dimensional data can describe real-world things more realistically and effectively. However, these data might include vast redundant and irrelevant information. If we process these data directly, it not only consumes a large amount of storage space and computational resources but also leads to the performance degradation of existing models [1]. Therefore, it is necessary to mine the potential relationships between the data to select and learn useful feature information.

Feature representation learning (FRL) is one of the most effective methods of learning useful feature information. Among the existing FRL methods, feature extraction (FE) [2] and feature selection (FS) [3] are two representative methods. FE aims to map the original high-dimensional feature space to a low-dimensional subspace according to some pre-defined criteria [4]. FS selects an optimal feature subset from the original feature set based on evaluation metrics [5]. In comparison, FS is more interpretable than FE since it can remove irrelevant and redundant features from the original features and retain a small number of relevant features. Therefore, FS is widely used in image classification, bioinformatics, face recognition, medical image analysis, natural language processing, and other fields [6].

FS methods can be divided into unsupervised feature selection (UFS), supervised feature selection (SFS), and semi-supervised feature selection (SSFS). UFS methods can achieve feature selection by only using unlabeled data; they have received widespread attention since they do not require any labeled data. However, a lack of label-guided learning in UFS methods will lead to poor performance on practical application tasks [7]. Thus, SFS methods have been devised to leverage the label information of the sample to guide the process of FS, enhancing the distinctiveness and consequently of selected features, improving the performance of the classification and clustering [8]. However, obtaining ample labeled data is very challenging and time consuming in practical situations. For this reason, many SSFS methods have been proposed in the past decades. SSFS methods employ semi-supervised learning (SSL) to leverage the information of limited labeled data and a substantial volume of unlabeled data, enhancing the feature selection ability of the model [9]. The existing SSFS methods can be classified as filtered, wrapped, and embedded methods [10]. Filtered methods first evaluate each feature based on the principles of statistical or information theory and then perform the process of FS in terms of the calculated weights. A major benefit of filtering methods is that they are more applicable to large-scale datasets since they have high speed and computational efficiency. However, filtered methods may ignore the amount of redundant information generated by the combination of multiple features [11]. Thus, some wrapped approaches have been proposed to exploit the interrelationship of features to mine the best combination of features. However, these approaches have high computational complexity. This makes them unsuitable for processing large-scale data [12]. In contrast to the above-mentioned methods, embedded methods combine FS and model training together. That is, FS is automatically executed during the process of model training, which makes embedded methods improve the efficiency of FS by reducing runtime [13]. Therefore, they have become mainstream and widely used in various scenarios.

In recent years, several semi-supervised embedded feature selection (SSEFS) methods have emerged. For example, Zhao et al. [14] introduced an SSFS method using both labeled and unlabeled data. Recognizing sparse regularization is an effective strategy for selecting useful features and reducing feature representation dimensions [15]. Chen et al. [16] introduced an efficient semi-supervised feature selection (ESFS) method. ESFS first combines SSL and sparse regularization to obtain feature subsets. Then, it uses probability matrices of unlabeled data to measure feature relevance to the class, aiming to identify the globally optimal feature subset. Least squares regression (LSR) with complete statistical theory can handle noisy data effectively and thus improve computational efficiency [17]. Therefore, Chang et al. [18] proposed a convex sparse feature selection (CSFS) method based on LSR, which employs the convex optimization theory to fit samples and predict labels to select the most critical features using constraint terms. Chen et al. [19] contended that LSR-based feature selection lacks interpretability and struggles to identify a global sparse solution. Hence, they proposed an embedded SSFS method based on rescaled linear regression, which exploits the L₂₁ norm to obtain both global and sparse solutions. Moreover, they also introduced a sparse regularization with an implicit L2p norm to obtain sparser and more interpretable solutions [20]. Therefore, this approach effectively constrains regression coefficients, achieving feature ranking. Besides, Liu et al. [21] combined sparse features and considered the correlation of samples in the original high- and low-dimensional spaces to improve the performance of feature learning. Despite the good results achieved by the sparse model-based methods, there are still some problems.

The first problem is that most of the methods do not consider constructing graphs to better preserve the geometric structural information of the data during the FS process. Initially, KNN was adopted by some FS methods to construct graphs based on Euclidean distances [22,23,24,25]. To minimize the influence of the redundant features and noise in the original high-dimensional data on the constructing graph process, Chen et al. [26] employed local discriminant analysis (LDA) to map the data from high-dimensional space to low-dimensional space. Subsequently, numerous graph construction methods based on data correlation have been presented, including L₁ graph [27], low-rank representation (LRR) [28], local structure learning [29], and sparse subspace clustering (SSC) [30], to construct high-quality graphs. The above-mentioned graph construction methods are integrated into FS models, proposing a large number of improvements for feature selection [31,32,33,34,35,36,37]. However, the processes of adaptive graph construction and FS in the above-mentioned methods are independent of each other, so the influence of graph construction on the FS process is limited. To this end, some methods have been constructed to unify adaptive graph learning (AGL) and FS into a single framework [38,39,40,41].

The second problem is that the spatial distribution of the sample label information is not sufficiently considered, resulting in the weak discriminative ability of the selected features, which further leads to poor classification or clustering performance. To alleviate this issue, label propagation (LP) has been incorporated into the FS methods [42,43,44]. However, since LP is also a graph-learning-based algorithm, the quality of the learned graph affects the performance to some extent. Therefore, numerous methods have emerged to merge AGL and LP [45,46,47,48]. However, these methods still have the following limitations: (1) the process of AGL is based on the original data; (2) the process of adaptive graph construction only considers the local structure or the global structure. Therefore, these methods are inevitably affected by high-dimensional features or noisy data.

To address the above-mentioned issues, this study develops a novel SSFS framework, SFS-AGGL, which integrates FS, AGL, and LP to capture both global and local data structural information for selecting an optimal feature subset with maximum discrimination and minimum redundancy. In AGL, global and local constraints are imposed on the construction coefficient obtained by the self-representation of low-dimensionally selected features. Meanwhile, the similarity matrix obtained by AGL is integrated into LP, enhancing label prediction performance. To improve the discriminative ability of the selected features, the predicted label matrix is introduced into the sparse feature selection (SFS) process. SFS is performed through the mutual promotion of the three models. The framework of the proposed SFS-AGGL is shown in Figure 1.

The primary contributions of this paper are as follows:

(1) An efficient SSFS framework is proposed by combining the advantages of FS, adaptive learning, and LP.

(2) An adaptive learning strategy based on low-dimensional features is designed to counteract the influence of high-dimensional features or noise data. Moreover, global and local constraints are introduced.

(3) An LP based on an adaptive similarity matrix is introduced to enhance label prediction accuracy.

(4) Comprehensive experiments conducted on multiple real datasets demonstrate that the proposed SFS-AGGL method surpasses existing representative methods in classification and clustering tasks.

The rest of this paper is organized as follows: Section 2 describes some related work; Section 3 outlines the details of the proposed method and the iterative minimization strategy employed to optimize the objective function; Section 4 introduces the experimental setup and provides a comprehensive analysis of the obtained results, including comparisons with eight state-of-the-art methods on five real datasets; and Section 5 provides a summary of our work in this paper.

2. Related Work

In this section, we have first provided some commonly used notations. Then, sparse representation and graph construction methods are introduced. Finally, some semi-supervised feature selection methods are briefly reviewed.

2.1. Notations

Let

X = [X_{l}, X_{u}] = [x_{1}, \cdot \cdot \cdot, x_{l}, x_{l + 1}, \cdot \cdot \cdot, x_{l + 1 + u}] \in R^{m \times n}

denote the training samples, where

x_{i} \in R^{m}

denotes the i-th sample.

Y = {[Y_{l} Y_{u}]}^{T} \in R^{c \times n}

is the label matrix, and

Y_{l}

denotes the true label of the labeled sample. If the sample

x_{i}

belongs to the class j, then its corresponding class label is

Y_{i j} = 1

; otherwise,

Y_{i j} = 0

.

Y_{u}

denotes the true label of the unlabeled sample. Since

Y_{u}

is unknown during the training process, it is set as a 0 matrix during training [49]. The main symbols in this paper are presented in Table 1.

Common matrix norms include L₁, L₂, F, and L₂₁ norms. Their detailed definitions are as follows:

| | B ‖_{1} = \sum_{i = 1}^{m} \sum_{j = 1}^{n} | b_{i j} |

(1)

| | B ‖_{2} = \sqrt{\sum_{i = 1}^{m} b_{i}^{2}}

(2)

| | B ‖_{F} = \sqrt{\sum_{i = 1}^{m} \sum_{j = 1}^{n} b_{i j}^{2}}

(3)

| | B ‖_{2, 1} = \sum_{i = 1}^{m} ‖ B^{i} | |_{2} = \sum_{i = 1}^{m} \sqrt{\sum_{j}^{n} b_{i j}^{2}}

(4)

where

B^{i}

is the i-th row vector of the matrix

B

. According to the matrix computation theory,

| | B ‖_{2, 1} = t r (B^{T} U B)

,

U \in R^{m \times m}

is a matrix consisting of diagonal elements

u_{i i} = 1 / ‖ B^{i} | |_{2}

.

2.2. Sparse Representation

Sparse representation is a method that was first developed in signal processing. The core idea of sparse representation is to find a target dictionary to describe the signal. To be specific, the original signal can be decomposed into linear combinations of elements in the dictionary. Only a few non-zero elements are used to represent signal information, while the rest can be ignored. Given a sample

X \in R^{n}

and a target dictionary

D

, it is desired to find a coefficient vector

a

such that the signal

X

can be represented as a linear combination of the basic elements of the target dictionary

D

.

\begin{array}{l} \min_{α} ‖ α ‖_{0} \\ s . t . X = D \times α . \end{array}

(5)

where

α \in R^{d}

is a one-dimensional vector [50] and

| | α ‖_{0}

is the L₀ norm of

α

. Due to the non-convexity and discontinuity of the L₀ norm, the L₁ norm is usually used to replace the L₀ norm to obtain an approximate solution, as shown in the following formula:

\begin{array}{l} \min_{α} ‖ α ‖_{1} \\ s . t . X = D \times α . \end{array}

(6)

Compared with the L₁ norm, the continuous derivability property of the L₂ norm can make the optimization algorithm more intuitive. Hence, the L₂ norm is commonly used to control overfitting, which can make the weight parameters of the model smoother and avoid overly complex models, as shown in the following formula:

\begin{array}{l} \min_{α} ‖ α ‖_{2} \\ s . t . X = D \times α . \end{array}

(7)

However, the disadvantage of the L₂ norm is that the model parameters will be close to 0, but most of them cannot be 0. Therefore, the L₂₁ norm, which is between the L₁ and L₂ norms, is proposed as an effective scheme, as shown in Equation (8):

\begin{array}{l} \min_{α} ‖ α ‖_{21} \\ s . t . X = D \times α . \end{array}

(8)

The advantage of L₂₁ norm is that it can make the elements of the whole row 0, thus achieving a similar sparse effect as L₁ and more robustness.

2.3. Constructing Graph Methods

KNN graph is a widely used method for constructing a similarity matrix. S_ij is the similarity of the sample x_i and x_j, which is defined as:

S_{i j} = \{\begin{cases} \exp (- \frac{| | x_{i} - x_{j} | |_{2}^{2}}{δ^{2}}) & if x_{i} \in N_{k} (x_{j}) or x_{j} \in N_{k} (x_{i}), \\ 0 . & else . \end{cases}

(9)

where

N_{k} (x_{j})

is a set that contains k nearest neighbor samples of the sample

x_{j}

and

δ

is a parameter.

From Equation (9), it can be seen that as the samples get closer, their similarity also increases. In addition, there are some similar methods, such as the ϵ-neighborhood method [51] and the fully connected method [52], which can also be utilized to construct graphs.

Unlike the KNN graph, the L₁ graph is an adaptive graph learning mechanism method that aims to reconstruct each sample by find the best sparse linear combination of other samples. The objective function of the L₁ graph can be described as follows:

\begin{array}{l} \min ‖ α_{i} ‖_{1} \\ s . t . x_{i} = X α_{i}, α_{i i} = 0 . \end{array}

(10)

Then, the weight matrix formed by the L₁ graph is expressed as

S = [α_{1}, α_{2}, \dots, α_{N}]

. Compared with KNN graphs, L₁ graphs can adaptively select the nearest samples for each sample.

2.4. Label Propagation Algorithm

The label propagation (LP) algorithm is a graph-based semi-supervised classification method that can effectively classify unknown samples using a small number of labeled samples. In the LP algorithm, similar samples should have similar labels. Therefore, the objective function of LP can be expressed as:

\begin{array}{l} \min_{F \geq 0} \sum_{i = 1}^{N} \sum_{j = 1}^{N} | | f_{i} - f_{j} | |_{2}^{2} s_{i j} + \sum_{i = 1}^{N} ‖ f_{i} - y_{i} | |_{2}^{2} u_{i i} \\ = t r (F^{T} L F) + t r ((F - Y) U {(F - Y)}^{T}) \end{array}

(11)

where

s_{i j}

can be computed by Equation (9) or Equation (10).

U \in R^{m \times m}

is a diagonal matrix that effectively utilizes category information from all samples in SSL. The diagonal elements of this matrix are defined as follows:

u_{i i} = \{\begin{cases} \infty if x_{i} is unlabeled, \\ 0 otherwise . \end{cases}

(12)

where the symbol

\infty

represents a relatively large constant. The first term in Equation (11) is based on the similarity of the data, which assigns similar labels to the neighboring samples to keep the graph as smooth as possible. The second term aims to minimize the difference between the matrix F and the label Y, i.e., the sample labels predicted by the trained model should be as consistent as possible with the true labels.

2.5. The Graph-Based Semi-Supervised Sparse Feature Selection

Sparse learning is widely used in machine learning due to its superior feature extraction capabilities. In this context, sparse regularization terms are used to penalize the projection matrix with the aim of selecting features with high sparsity and high discriminative properties. The following equation is commonly used for sparse feature selection:

\min_{W} L o s s (X, W, Y) + θ R (W)

(13)

where the

L o s s (X, W, Y)

is defined as a regression term and

R (W)

is a sparse regularization term, and

θ \geq 0

is a regularization weight to constrain both terms.

As we know, selecting features only using the information of the labeled sample is inaccurate and unreliable since the labeled samples are insufficient in SSL. Therefore, it is also necessary to make full use of the information of unlabeled samples to improve the performance. The following model can achieve feature selection by introducing an LP algorithm into the semi-supervised sparse model.

\min_{W, F} L o s s_{1} (X, W, F) + θ R (W) + α L o s s_{2} (F, Y)

(14)

where

L o s s_{2} (F, Y)

is the objective function of the LP algorithm shown in Equation (11).

It can be seen that when constructing the model above, the merits of the similarity matrix construction directly determine the performance. To alleviate the issue, the following graph-based semi-supervised sparse feature selection model has been developed.

\min_{W, F \geq 0, S \geq 0} L o s s_{1} (X, W, Y) + θ R (W) + α L o s s_{2} (F, Y, S) + β L o s s_{3} (X, S)

(15)

where

α, β, θ \geq 0

are model parameters.

L o s s_{3} (X, S)

is the objective function of the graph learning. Some adaptively constructed graph methods have been designed [38,39,40,41,48,53].

3. The Proposed Method

In this part, a detailed introduction of the SFS model is first presented. Second, a new AGL mechanism is introduced to make full use of the global and local information between the samples, which can acquire the geometric structural information of the original data well. Next, the similarity matrix learned by the AGL mechanism is integrated into the LP algorithm, which enhances label prediction performance and allows the model to classify and cluster unlabeled samples more accurately. Finally, the SFS, AGL, and LP models are fused in a unified framework to propose a novel SFS-AGGL method. Moreover, a new iterative updated algorithm is introduced to optimize the proposed model, and its convergence is confirmed through both theoretical and experimental testing.

3.1. Methodology Model

3.1.1. SFS Model

The L₂₁ sparsity constraint is applied to achieve the process of FS. In combination with LSR, a basic SFS model can be obtained as follows:

\begin{array}{l} \min_{W} ‖ X^{T} W - Y ‖_{2}^{2} + θ {‖ W ‖}_{2, 1} \\ s . t . W \geq 0 . \end{array}

(16)

where

W \in R^{d \times c}

denotes the feature projection matrix and θ is a regularization parameter.

3.1.2. Global and Local Adaptive Graph Learning (AGGL) Model

Although the sparse model-based approach has achieved good results in FS, there are still some problems, for e.g., the above-mentioned sparse model only focuses on the sample–label relationship and ignores the geometric structural information among the samples. To better preserve the original data’s geometric structural information, the method of adaptively constructing the nearest neighbor graph is usually adopted. However, the nearest neighbor information in the original feature space may be disturbed by redundant and noisy features. Previous research has shown that feature projection can effectively mitigate the negative impact of redundant and noisy features [54]. Therefore, when learning the nearest neighbor graph, the similarity matrix should be constructed through adaptive updates of sample similarities and their neighboring samples in the projected feature space. Hence, in this paper, the similarity of samples in the original high-dimensional space and the low-dimensional space is utilized to describe the local distribution structure more accurately, thus enhancing the effectiveness of the graph learning task. Specifically, we have used a coefficient reconstruction method to construct the graph, leading to the subsequent model:

\begin{array}{l} \min_{W, S} ‖ W^{T} X - W^{T} X S ‖_{2}^{2} \\ s . t . W \geq 0, S \geq 0 . \end{array}

(17)

where

S \in R^{n \times n}

and

s_{i} = [s_{i 1}, s_{i 2}, \cdot \cdot \cdot, s_{i n}] \in ℝ^{n \times 1}

denotes the reconstructed coefficient vector of the sample.

The maintenance of global and local sample information is crucial for sample reconstruction. That is, the similarity between the sample that needs to be reconstructed and its surrounding samples should be maintained in the process of sample reconstruction. To achieve this goal, we have incorporated global and local constraints into the sample reconstruction process. This ensures that the sample points are better reconstructed by the most adjacent sample points, thereby improving the quality of the construction graph. Specifically, we have combined global and local constraints with sparse learning to reconstruct samples, as shown in the following formula:

{‖ S ⊙ E ‖}_{1}

(18)

where

E = [e_{i j}] \in R^{n \times n}

and each element

e_{i j}

in E is defined as:

e_{i j} = \exp (\frac{{‖ x_{i} - x_{j} ‖}^{2}}{σ^{2}})

(19)

By combining Equations (17) and (18), the following adaptive graph construction model with global and local constraints is obtained:

\begin{array}{l} \min_{W, S} {‖ W^{T} X - W^{T} X S ‖}_{2}^{2} + λ {‖ S ⊙ E ‖}_{1} \\ s . t . W \geq 0, S \geq 0 . \end{array}

(20)

where

λ > 0

is the balance coefficient, which aims to balance the effects of the coefficient reconstruction term and the global and local constraint terms. By constructing the above model, we can effectively maintain the global and local information of the sample, thereby enhancing the similarity matrix of the graph.

3.1.3. Objective Function

As can be seen in Equation (17), the SFS model only utilizes the labeling information of the data. It ignores the spatial distribution of the labels, making it difficult to select the ideal subset of features. It has been shown that the structural distribution information embedded in unlabeled data is very important for FS when there is less labeling information [55]. For this reason, we have introduced the LP algorithm. Meanwhile, to make the LP process more efficient, we have introduced the adaptive graph coefficient matrix obtained by Equation (20) into LP. Therefore, a new SFS-AGGL algorithm is proposed by integrating SFS, AGGL, and LP into a unified learning framework. SFS-AGGL can account for both global and local sample information, and it is robust for FS. The objective function of SFS-AGGL is:

\begin{array}{l} \min ε (W, F, S) & = β ‖ W^{T} X - W^{T} X S | |_{2}^{2} + λ ‖ S ⊙ E | |_{1} \\ + α \sum_{i, j = 1}^{n} ‖ f_{i} - f_{j} | |_{2}^{2} S_{i j} + \sum_{i = 1}^{n} ‖ f_{i} - y_{i} | |_{2}^{2} u_{i i} \\ + ‖ X^{T} W - F | |_{2}^{2} + θ ‖ W | |_{2, 1} \\ s . t . W \geq 0, F \geq 0, S \geq 0 . \end{array}

(21)

where

α, β, θ, λ > 0

are the equilibrium control parameters to be adjusted in the experiment, and

⊙

denotes the product of matrix elements in their corresponding positions.

As shown in Equation (21), we first efficiently obtained the constructive coefficients by imposing global and local constraints while self-representing the low-dimensional features. Therefore, it can avoid possible redundant information to affect the learning performance due to predefined matrices not being introduced. Second, we introduced the similarity matrix obtained by AGL into the LP process to improve the accuracy of label prediction. In addition, to enhance the discriminative performance of the selected features, we introduced a predictive labeling matrix into the SFS process and completed the FS by mutual reinforcement of the three models: SFS, AGGL, and LP.

3.2. Model Optimization

The objective function of the SFS-AGGL method involves three variables, i.e., the feature projection matrix W, the prediction label matrix F, and the similarity matrix S. Since the objective functions of all three variables are non-convex, they cannot be optimized directly. However, the objective function exhibits convexity with respect to a single variable. Therefore, we can solve it step-by-step by performing convex optimization on each variable separately. The specific process of solving the objective function is as follows:

(1) Fixed variables F and S update variable W

Simplifying Equation (21) by removing the terms unrelated to the variable W, the following optimization function is obtained:

\min ε (W) = ‖ X^{T} W - F | |_{2}^{2} + β ‖ W^{T} X - W^{T} X S | |_{2}^{2} + θ ‖ W | |_{2, 1}

(22)

From the definition of matrix trace, Equation (23) can be derived from Equation (22) by using a simple algebraic transformation as follows:

\begin{array}{l} \min ε (W) = t r ((X^{T} W - F) {(X^{T} W - F)}^{T}) \\ + β t r ((W^{T} X - W^{T} X S) {(W^{T} X - W^{T} X S)}^{T}) + θ t r (W^{T} H W) \\ = t r (X^{T} W W^{T} X - 2 F W^{T} X + F F^{T}) \\ + β t r (\begin{array}{l} W^{T} X X^{T} W - 2 W^{T} X S^{T} X^{T} W \\ + W^{T} X S S^{T} X^{T} W \end{array}) + θ t r (W^{T} H W) \end{array}

(23)

To solve Equation (23), a Lagrange multiplier and the corresponding Lagrange functions are introduced, which can be constructed as follows:

ε (W, ϑ) = t r (\begin{array}{l} X^{T} W W^{T} X - 2 F W^{T} X + F F^{T} + β W^{T} X X^{T} W \\ - 2 β W^{T} X S^{T} X^{T} W + β W^{T} X S S^{T} X^{T} W + θ W^{T} H W \end{array}) + t r (ϑ W)

(24)

Next, the partial derivative regarding the variable W is computed and then set to 0 as follows:

\frac{\partial ε (W, ϑ)}{\partial W} = (\begin{array}{l} 2 X X^{T} W - 2 X F + 2 β X X^{T} W - 4 β X S^{T} X^{T} W \\ + 2 β X S S^{T} X^{T} W + 2 θ W^{T} H W + ϑ \end{array}) = 0

(25)

Meanwhile, by combining the Karush–Kuhn–Tucker (KKT) condition

(ϑ_{i j} W_{i j} = 0)

, we can obtain Equation (26) as follows:

{(\begin{array}{l} 2 X X^{T} W - 2 X F + 2 β X X^{T} W - 4 β X S^{T} X^{T} W \\ + 2 β X S S^{T} X^{T} W + 2 θ H W \end{array})}_{i j} W_{i j} = 0

(26)

Therefore, an updated rule for the variable W can be obtained:

W_{i j} = W_{i j} \frac{{[X F + 2 β X S^{T} X^{T} W]}_{i j}}{{[X X^{T} W + β X X^{T} W + β X S S^{T} X^{T} W + θ H W]}_{i j}}

(27)

(2) Fixed variables W and S update variable F

We first remove the terms unrelated to the variable F from Equation (21), and the optimization function on the variable F is acquired as:

\min ε (F) = {‖ X^{T} W - F ‖}_{2}^{2} + α \sum_{i, j = 1}^{n} ‖ f_{i} - f_{j} | |_{2}^{2} S_{i j} + \sum_{i = 1}^{n} ‖ f_{i} - y_{i} | |_{2}^{2} u_{i i}

(28)

According to the definition of matrix trace, we can use a simple algebraic transformation to obtain Equation (29) as follows:

\begin{array}{l} \min ε (F) \\ = t r ((X^{T} W - F) {(X^{T} W - F)}^{T}) + α t r (F^{T} L F) + t r ((F - Y) {(F - Y)}^{T}) \\ = t r (X^{T} W W^{T} X - 2 F W^{T} X + F F^{T}) + α t r (F^{T} L F) + t r (F U F^{T} - 2 F U Y^{T} + Y U Y^{T}) \\ = t r (X^{T} W W^{T} X - 2 F W^{T} X + F F^{T} + α F^{T} L F + F U F^{T} - 2 F U Y^{T} + Y U Y^{T}) \end{array}

(29)

Next, we have introduced a Lagrange multiplier to optimize Equation (29), and the corresponding Lagrange function can be defined as follows:

ε (F, μ) = t r (\begin{array}{l} X^{T} W W^{T} X - 2 F W^{T} X + F F^{T} \\ + α F^{T} L F + F U F^{T} - 2 F U Y^{T} + Y U Y^{T} \end{array}) + t r (μ F)

(30)

Then, we have calculated the partial derivative with respect to the variable F and set it to 0 as follows:

\frac{\partial ε (F, μ)}{\partial F} = (- 2 X^{T} W + 2 F + 2 α L F + 2 F U - 2 Y U^{T} + μ) = 0

(31)

Following the KKT condition

(μ_{i j} F_{i j} = 0)

, we can derive Equation (32) as shown:

{(- 2 X^{T} W + 2 F + 2 α L F + 2 F U - 2 Y U^{T})}_{i j} F_{i j} = 0

(32)

Finally, we have provided an iterative updated rule for the variable F as follows:

F_{i j} = F_{i j} \frac{{[X^{T} W - Y U^{T}]}_{i j}}{{[F + α L F + F U]}_{i j}}

(33)

(3) Fixed variables W and F update variable S

Likewise, by removing the terms unrelated to the variable S, the optimization function becomes the following form:

\min ε (S) = α \sum_{i, j = 1}^{n} {‖ f_{i} - f_{j} ‖}_{2}^{2} S_{i j} + β {‖ W^{T} X - W^{T} X S ‖}_{2}^{2} + λ {‖ S ⊙ E ‖}_{1}

(34)

Equation (34) can be reduced to Equation (35) as follows:

\begin{array}{l} \min ε (S) = α t r (F^{T} L F) + β t r ((W^{T} X - W^{T} X S) {(W^{T} X - W^{T} X S)}^{T}) + λ S ⊙ E \\ = α t r (F^{T} L F) + β t r (W^{T} X X^{T} W - 2 W^{T} X S^{T} X^{T} W + W^{T} X S S^{T} X^{T} W) + λ S E \\ = t r (α F^{T} D F - α F^{T} S F + β W^{T} X X^{T} W - 2 β W^{T} X S^{T} X^{T} W + β W^{T} X S S^{T} X^{T} W) + λ S E \end{array}

(35)

Here, a Lagrange multiplier is utilized to determine the optimal solution of Equation (35), and the related Lagrange function is formulated as:

ε (S, ξ) = (\begin{array}{l} t r (α F^{T} D F - α F^{T} S F + β W^{T} X X^{T} W - 2 β W^{T} X S^{T} X^{T} W + β W^{T} X S S^{T} X^{T} W) \\ + λ S E + t r (ξ S) \end{array})

(36)

The partial derivative regarding the variable S is then set to 0 as follows:

\frac{\partial ε (S, ξ)}{\partial S} = (- α F F^{T} - 2 β X^{T} W W^{T} X + 2 β X^{T} W W^{T} X S + λ E + ξ) = 0

(37)

Since the KKT condition

(ξ_{i j} S_{i j} = 0)

exists, we can obtain Equation (38) as follows:

{(- α F F^{T} - 2 β X^{T} W W^{T} X + 2 β X^{T} W W^{T} X S + λ E)}_{i j} S_{i j} = 0

(38)

Therefore, an expression of the following form for the variable S can be obtained:

S_{i j} = S_{i j} \frac{{[α F F^{T} + 2 β X^{T} W W^{T} X]}_{i j}}{{[2 β X^{T} W W^{T} X S + λ E]}_{i j}}

(39)

3.3. Algorithm Description

Algorithm 1 describes the SFS-AGGL method in detail, while Figure 2 depicts its flowchart. Moreover, the SFS-AGGL algorithm stops iterating when the alteration of the objective function value between consecutive iterations is below a threshold or the maximum number of iterations is reached.

Algorithm 1: SFS-AGGL

Input: Sample Matrix:

X = [X_{L}, X_{U}] \in R^{d \times n}

Label Matrix:

Y = {[Y_{l}; Y_{u}]}^{T} \in R^{n \times c}

Parameters:

α \geq 0, β \geq 0, θ \geq 0, λ \geq 0

Output: Feature Projection Matrix W
Predictive Labeling Matrix F
Similarity Matrix S

1: Initialization: the initial non-negative matrix

W_{0}, F_{0}, S_{0}

,

i t e r = 0

;
2: Calculation of the matrix

U_{i t e r}

according to Equation (12);
3: Repeat
4: According to Equation (27) update

W_{i t e r}

as

W_{i t e r} \leftarrow \frac{X F + 2 β X S^{T} X^{T} W}{X X^{T} W + β X X^{T} W + β X S S^{T} X^{T} W + θ H W}

;
5: According to Equation (33) update

F_{i t e r}

as

F_{i t e r} \leftarrow \frac{X^{T} W - Y U^{T}}{F + α L F + F U}

;
6: According to Equation (39) update

S_{i t e r}

as

S_{i t e r} \leftarrow \frac{α F F^{T} + 2 β X^{T} W W^{T} X}{2 β X^{T} W W^{T} X S + λ E}

;
7: According to Equation (19) update

E

;
8: Update

i t e r = i t e r + 1

;
9: Until converges

3.4. Computational Complexity and Convergence Analysis

3.4.1. Computational Complexity Analysis

Based on Algorithm 1, the SFS-AGGL algorithm’s computational complexity comprises two parts. The first part is the computation of the diagonal auxiliary matrix U in step 2, and the second part is the updating of three matrices (W, F, and S) during each iteration and the computation of the local matrix E in step 7. The computational or updating components of each matrix are defined in Table 2. Therefore, the total complexity of the SFS-AGGL algorithm is

O (\max (k n^{2}, c n^{2}) + (i t e r \times \max (c m n, c n^{2}))

, where iter is the iteration count. Furthermore, the computational complexities of other related FS methods are also presented in Table 3.

3.4.2. Proof of Convergence

Definition 1.

If functions

φ (q, q^{'})

and

ψ (q)

meet these two conditions, as shown in Equation (40),

φ (q, q^{'})

is an auxiliary function of

ψ (q)

.

\begin{array}{l} φ (q, q^{'}) \geq ψ (q), \\ φ (q, q) = ψ (q), \end{array}

(40)

Lemma 1.

If Definition 1 holds,

ψ (q)

is non-increasing in Equation (41).

q^{(i t e r + 1)} = \underset{q}{\arg \min} φ (q, q^{(i t e r)})

(41)

Proof.

ψ (q^{(i t e r + 1)}) \leq φ (q^{(i t e r + 1)}, q^{(t i e r)}) \leq φ (q^{(i t e r)}, q^{(i t e r)}) = ψ (q^{(i t e r)})

(42)

It is only necessary to show that the variables W, F, and S are non-decreasing under the update rule as shown in Equation (42). For this purpose, we have computed and presented the first- and second-order derivatives of each formula in Table 4. □

Lemma 2.

\begin{array}{l} φ (W_{i j}, W_{i j}^{(i t e r)}) & = ψ_{i j} (W_{i j}, W_{i j}^{(i t e r)}) + {ψ^{'}}_{i j} (W_{i j}) (W_{i j} - W_{i j}^{(i t e r)}) \\ + \frac{{[X X^{T} W + β X X^{T} W + β X S S^{T} X^{T} W + θ H W]}_{i j}}{W_{i j}^{(i t e r)}} {(W_{i j} - W_{i j}^{(i t e r)})}^{2} \end{array}

(43)

\begin{array}{l} φ (F_{i j}, F_{i j}^{(i t e r)}) & = ψ_{i j} (F_{i j}, F_{i j}^{(i t e r)}) + {ψ^{'}}_{i j} (F_{i j}) (F_{i j} - F_{i j}^{(i t e r)}) \\ + \frac{{[F + α L F + F U]}_{i j}}{F_{i j}^{(i t e r)}} {(F_{i j} - F_{i j}^{(i t e r)})}^{2} \end{array}

(44)

\begin{array}{l} φ (S_{i j}, S_{i j}^{(i t e r)}) & = ψ_{i j} (S_{i j}, S_{i j}^{(i t e r)}) + {ψ^{'}}_{i j} (S_{i j}) (S_{i j} - S_{i j}^{(i t e r)}) \\ + \frac{{[2 β X^{T} W W^{T} X S + λ E]}_{i j}}{S_{i j}^{(i t e r)}} {(S_{i j} - S_{i j}^{(i t e r)})}^{2} \end{array}

(45)

Equations (43)–(45) are both auxiliary functions of

ψ_{i j}

.

Proof.

Taylor series expansion of

ψ_{i j} (W_{i j}, W_{i j}^{(i t e r)})

:

\begin{array}{l} ψ_{i j} (W_{i j}) & = ψ_{i j} (W_{i j}^{(i t e r)}) + {ψ^{'}}_{i j} (W_{i j}) (W_{i j} - W_{i j}^{(i t e r)}) + \frac{1}{2} {ψ^{″}}_{i j} (W_{i j}) {(W_{i j} - W_{i j}^{(i t e r)})}^{2} \\ = ψ_{i j} (W_{i j}^{(i t e r)}) + ψ^{'} (V_{i j}) (W_{i j} - W_{i j}^{(i t e r)}) \\ + {(X X^{T} + β X X^{T} - 2 β X S X^{T} + β X S S^{T} X^{T} + θ H^{T})}_{i i} {(W_{i j} - W_{i j}^{(i t e r)})}^{2} \end{array}

(46)

φ (W_{i j}, W_{i j}^{(i t e r)}) \geq ψ_{i j} (W_{i j})

Equivalent to:

\begin{array}{l} \frac{{[X X^{T} W + β X X^{T} W + β X S S^{T} X^{T} W + θ H W]}_{i j}}{W_{i j}^{(i t e r)}} \\ \geq {(X X^{T} + β X X^{T} - 2 β X S X^{T} + β X S S^{T} X^{T} + θ H^{T})}_{i i} \end{array}

(47)

By comparing Equations (43) and (46), we get that

φ (W_{i j}, W_{i j}^{(i t e r)}) \geq ψ_{i j} (W_{i j})

holds, therefore,

φ (W_{i j}, W_{i j}^{(i t e r)}) = ψ_{i j} (W_{i j})

also holds.

{[X X^{T} W]}_{i j} = \sum_{k = 1}^{r} {(X X^{T})}_{i k} W_{_{k j}}^{(i t e r)} \geq {(X X^{T})}_{i j} W_{_{i j}}^{(i t e r)}

(48)

{[X S S^{T} X^{T} W]}_{i j} = \sum_{k = 1}^{r} {(X S S^{T} X^{T})}_{i k} W_{_{k j}}^{(i t e r)} \geq {(X S S^{T} X^{T})}_{i j} W_{_{i j}}^{(i t e r)}

(49)

{[H W]}_{i j} = \sum_{k = 1}^{r} H_{i k} W_{_{k j}}^{(i t e r)} \geq H_{i i} W_{_{i j}}^{(i t e r)}

(50)

□

Similarly, it is possible to prove Equations (44) and (45). Finally, based on Lemma 1, the update schemes for the variables W, F, S are derived in this paper as shown in Equations (51)–(53).

Theorem 1.

W \geq 0

,

F \geq 0

, and

S \geq 0

, updating iterative Formulas (27), (33), and (39) are non-increasing.

Proof.

Bringing Equation (43) into Equation (41):

\begin{array}{l} W_{i j}^{(i t e r + 1)} & = \underset{W_{i j}}{\arg \min} φ (W_{i j}, W_{i j}^{(i t e r)}) \\ = W_{i j}^{(i t e r)} - W_{i j}^{(i t e r)} \frac{ψ^{″} (W_{i j})}{{[X X^{T} W + β X X^{T} W + β X S S^{T} X^{T} W + θ H W]}_{i j}} \\ = W_{i j}^{(i t e r)} \frac{{[X F + 2 β X S^{T} X^{T} W]}_{i j}}{{[X X^{T} W + β X X^{T} W + β X S S^{T} X^{T} W + θ H W]}_{i j}} \end{array}

(51)

Bringing Equation (44) into Equation (41):

\begin{array}{l} F_{i j}^{(i t e r + 1)} = \underset{F_{i j}}{\arg \min} φ (F_{i j}, F_{i j}^{(i t e r)}) \\ = F_{i j}^{(i t e r)} - F_{i j}^{(i t e r)} \frac{ψ^{″} (F_{i j})}{{[F + α L F + F U]}_{i j}} = F_{i j}^{(i t e r)} \frac{{[X^{T} W - Y U^{T}]}_{i j}}{{[F + α L F + F U]}_{i j}} \end{array}

(52)

Bringing Equation (45) into Equation (41):

\begin{array}{l} S_{i j}^{(i t e r + 1)} = \underset{S_{i j}}{\arg \min} φ (S_{i j}, S_{i j}^{(i t e r)}) \\ = S_{i j}^{(i t e r)} - S_{i j}^{(i t e r)} \frac{ψ^{″} (S_{i j})}{{[2 β X^{T} W W^{T} X S + λ E]}_{i j}} = S_{i j}^{(i t e r)} \frac{{[α F F^{T} + 2 β X^{T} W W^{T} X]}_{i j}}{{[2 β X^{T} W W^{T} X S + λ E]}_{i j}} \end{array}

(53)

It is obvious that Equations (51)–(53) are auxiliary functions of

ψ_{i j}

, resulting in a non-increasing

ψ_{i j}

under their respective update rules.

Next, the upcoming focus will be on demonstrating the convergence of iteration-based Algorithm 1.

For any non-zero vectors

u \in R^{m}

and

v \in R^{m}

, the following inequalities exist:

{‖ u ‖}_{2} - \frac{{‖ u ‖}_{2}^{2}}{2 ‖ v | |_{2}} \leq {‖ u ‖}_{2} - \frac{‖ u | |_{2}^{2}}{2 ‖ u | |_{2}}

(54)

The proof of Equation (54) can be found in the literature [55]. □

Theorem 2.

Referring to Algorithm 1, Equation (21) decreases in each iteration until it converges.

Proof.

Let

H^{i t e r}

denote the process of the iter-th iteration, then updating

W^{i t e r + 1}

,

F^{i t e r + 1}

and

S^{i t e r + 1}

involves solving the given inequality:

ϕ (W^{i t e r + 1}, F^{i t e r + 1}, S^{i t e r + 1}, H^{i t e r}) \leq ϕ (W^{i t e r}, F^{i t e r}, S^{i t e r}, H^{i t e r})

(55)

□

According to Equation (55), we obtain:

\begin{array}{l} t r ((X^{T} W^{(i t e r + 1)} - F^{(i t e r + 1)}) {(X^{T} W^{(i t e r + 1)} - F^{(i t e r + 1)})}^{T}) + α t r ({(F^{(i t e r + 1)})}^{T} L F^{(i t e r + 1)}) \\ + t r ((F^{(i t e r + 1)} - Y) U {(F^{(i t e r + 1)} - Y)}^{T}) \\ + β t r (({(W^{(i t e r + 1)})}^{T} X - {(W^{(i t e r + 1)})}^{T} X S^{(i t e r + 1)}) {({(W^{(i t e r + 1)})}^{T} X - {(W^{(i t e r + 1)})}^{T} X S^{(i t e r + 1)})}^{T}) \\ + θ t r ({(W^{(i t e r + 1)})}^{T} H^{(i t e r)} W^{(i t e r + 1)}) + λ S^{(i t e r + 1)} E \\ \leq t r ((X^{T} W^{(i t e r)} - F^{(i t e r)}) {(X^{T} W^{(i t e r)} - F^{(i t e r)})}^{T}) + α t r ({(F^{(i t e r)})}^{T} L F^{(i t e r)}) \\ + t r ((F^{(i t e r)} - Y) U {(F^{(i t e r)} - Y)}^{T}) \\ + β t r ({(W^{(i t e r)})}^{T} X - {(W^{(i t e r)})}^{T} X S^{(i t e r)}) {({(W^{(i t e r)})}^{T} X - {(W^{(i t e r)})}^{T} X S^{(i t e r)})}^{T}) \\ + θ t r ({(W^{(i t e r)})}^{T} H^{(i t e r)} W^{(i t e r)}) + λ S^{(i t e r)} E \end{array}

(56)

Again, based on the definition of matrix

H^{i t e r}

, Equation (56) can be rewritten as:

\begin{array}{l} t r ((X^{T} W^{(i t e r + 1)} - F^{(i t e r + 1)}) {(X^{T} W^{(i t e r + 1)} - F^{(i t e r + 1)})}^{T}) + α t r ({(F^{(i t e r + 1)})}^{T} L F^{(i t e r + 1)}) \\ + t r ((F^{(i t e r + 1)} - Y) U {(F^{(i t e r + 1)} - Y)}^{T}) \\ + β t r (({(W^{(i t e r + 1)})}^{T} X - {(W^{(i t e r + 1)})}^{T} X S^{(i t e r + 1)}) {({(W^{(i t e r + 1)})}^{T} X - {(W^{(i t e r + 1)})}^{T} X S^{(i t e r + 1)})}^{T}) \\ + θ \sum_{i = 1}^{m} \frac{| | {(W^{(i t e r + 1)})}^{i} | |_{2}^{2}}{2 | | {(W^{(i t e r + 1)})}^{i} | |_{2}} + λ S^{(i t e r + 1)} E \\ \leq t r ((X^{T} W^{(i t e r)} - F^{(i t e r)}) {(X^{T} W^{(i t e r)} - F^{(i t e r)})}^{T}) + α t r ({(F^{(i t e r)})}^{T} L F^{(i t e r)}) \\ + t r ((F^{(i t e r)} - Y) U {(F^{(i t e r)} - Y)}^{T}) \\ + β t r (({(W^{(i t e r)})}^{T} X - {(W^{(i t e r)})}^{T} X S^{(i t e r)}) {({(W^{(i t e r)})}^{T} X - {(W^{(i t e r)})}^{T} X S^{(i t e r)})}^{T}) \\ + θ \sum_{i = 1}^{m} \frac{| | {(W^{(i t e r)})}^{i} | |_{2}^{2}}{2 | | {(W^{(i t e r)})}^{i} | |_{2}} + λ S^{(i t e r)} E \end{array}

(57)

Thus, there is the following inequality:

\begin{array}{l} t r ((X^{T} W^{(i t e r + 1)} - F^{(i t e r + 1)}) {(X^{T} W^{(i t e r + 1)} - F^{(i t e r + 1)})}^{T}) + α t r ({(F^{(i t e r + 1)})}^{T} L F^{(i t e r + 1)}) \\ + t r ((F^{(i t e r + 1)} - Y) U {(F^{(i t e r + 1)} - Y)}^{T}) \\ + β t r (({(W^{(i t e r + 1)})}^{T} X - {(W^{(i t e r + 1)})}^{T} X S^{(i t e r + 1)}) {({(W^{(i t e r + 1)})}^{T} X - {(W^{(i t e r + 1)})}^{T} X S^{(i t e r + 1)})}^{T}) \\ + θ \sum_{i = 1}^{m} ‖ {(W^{(i t e r + 1)})}^{i} | |_{2} - θ (\sum_{i = 1}^{m} | | {(W^{(i t e r + 1)})}^{i} | |_{2} - \sum_{i = 1}^{m} \frac{| | {(W^{(i t e r + 1)})}^{i} | |_{2}^{2}}{2 | | {(W^{(i t e r + 1)})}^{i} | |_{2}}) + λ S^{(i t e r + 1)} E \\ \leq t r ((X^{T} W^{(i t e r)} - F^{(i t e r)}) {(X^{T} W^{(i t e r)} - F^{(i t e r)})}^{T}) + α t r ({(F^{(i t e r)})}^{T} L F^{(i t e r)}) \\ + t r ((F^{(i t e r)} - Y) U {(F^{(i t e r)} - Y)}^{T}) \\ + β t r (({(W^{(i t e r)})}^{T} X - {(W^{(i t e r)})}^{T} X S^{(i t e r)}) {({(W^{(i t e r)})}^{T} X - {(W^{(i t e r)})}^{T} X S^{(i t e r)})}^{T}) \\ + θ \sum_{i = 1}^{m} ‖ {(W^{(i t e r)})}^{i} | |_{2} - θ (\sum_{i = 1}^{m} | | {(W^{(i t e r)})}^{i} | |_{2} - \sum_{i = 1}^{m} \frac{| | {(W^{(i t e r)})}^{i} | |_{2}^{2}}{2 | | {(W^{(i t e r)})}^{i} | |_{2}}) + λ S^{(i t e r)} E \end{array}

(58)

From Equation (58), we have:

\sum_{i = 1}^{m} ‖ {(W^{(i t e r + 1)})}^{i} | |_{2} - \sum_{i = 1}^{m} \frac{‖ {(W^{(i t e r + 1)})}^{i} | |_{2}^{2}}{2 ‖ {(W^{(i t e r + 1)})}^{i} | |_{2}} \leq \sum_{i = 1}^{m} ‖ {(W^{(i t e r)})}^{i} | |_{2} - \sum_{i = 1}^{m} \frac{‖ {(W^{(i t e r)})}^{i} | |_{2}^{2}}{2 ‖ {(W^{(i t e r)})}^{i} | |_{2}}

(59)

Considering Equations (55)–(59) together, the following results can be obtained:

\begin{array}{l} t r ((X^{T} W^{(i t e r + 1)} - F^{(i t e r + 1)}) {(X^{T} W^{(i t e r + 1)} - F^{(i t e r + 1)})}^{T}) + α t r ({(F^{(i t e r + 1)})}^{T} L F^{(i t e r + 1)}) \\ + t r ((F^{(i t e r + 1)} - Y) U {(F^{(i t e r + 1)} - Y)}^{T}) \\ + β t r (({(W^{(i t e r + 1)})}^{T} X - {(W^{(i t e r + 1)})}^{T} X S^{(i t e r + 1)}) {({(W^{(i t e r + 1)})}^{T} X - {(W^{(i t e r + 1)})}^{T} X S^{(i t e r + 1)})}^{T}) \\ + θ \sum_{i = 1}^{m} ‖ {(W^{(i t e r + 1)})}^{i} | |_{2} + λ S^{(i t e r + 1)} E \\ \leq t r ((X^{T} W^{(i t e r)} - F^{(i t e r)}) {(X^{T} W^{(i t e r)} - F^{(i t e r)})}^{T}) + α t r ({(F^{(i t e r)})}^{T} L F^{(i t e r)}) \\ + t r ((F^{(i t e r)} - Y) U {(F^{(i t e r)} - Y)}^{T}) \\ + β t r (({(W^{(i t e r)})}^{T} X - {(W^{(i t e r)})}^{T} X S^{(i t e r)}) {({(W^{(i t e r)})}^{T} X - {(W^{(i t e r)})}^{T} X S^{(i t e r)})}^{T}) \\ + θ \sum_{i = 1}^{m} ‖ {(W^{(i t e r)})}^{i} | |_{2} + λ S^{(i t e r)} E \end{array}

(60)

The inequality in Equation (60) shows the value of the objective function is decreased per iteration, indicating the optimization algorithm’s progress toward a more optimal solution at each step. In addition, since there is a lower bound on the objective function, our proposed optimization algorithm will converge. We also adopted numerical experiments to further verify the effectiveness of the optimization algorithm, and the experimental result demonstrates that the objective function value consistently decreases as the number of iterations increases.

4. Experiment and Analysis

In this section, the effectiveness of the proposed method is validated on classification and clustering tasks, respectively. We first used five image classification datasets to test the classification performance of the proposed method and then employed two image datasets and two subsets of UCI data to verify the clustering performance of the proposed method. In the experiment, we compared our proposed method with some contemporary UFS and SSFS methods, including two UFS methods (SPNFSR [56] and NNSAFS [57]) and six SSFS methods (RLSR [19], FDEFS [50], GS³FS [43], S2LFS [44], AGLRM [47], and ASLCGLFS [48]).

4.1. Description of the Comparison Methods

In order to verify the effectiveness of our method and comprehensively evaluate the strengths and weakness of our proposed method, we compared it with some classical and novel benchmark methods for unsupervised and semi-supervised FS, which are similar to our method. Compared to these existing methods, our method is an improvement and innovation of them, which is with a general tendency toward continuous improvement.

(1) SPNFSR is a UFS algorithm that uses a low-rank representation graph for maintaining feature structures, and it achieves FS by using the L₂₁ norm and non-negative constraints on the reconstruction coefficient matrix. The objective function of the SPNFSR method can be defined as follows:

\begin{array}{l} \min ‖ X - X W | |_{2, 1} + α t r (W^{T} M W) + β ‖ W | |_{2, 1} \\ s . t . W \geq 0 . \end{array}

(61)

where the matrix M is obtained by solving the low-rank representation. In the SPNFSR method, the processes of graph construction and feature selection are performed independently, so the quality of the matrix M will directly affect the performance of feature selection.

(2) NNSAFS is a UFS algorithm that employs adaptive rank constraints and non-negative spectral feature learning. It employs sparse regression and feature mapping to mine the local structural information of the feature space to improve the adaptability of manifold learning. The objective function of NNSAFS can be defined as follows:

\begin{array}{l} \min {‖ X^{T} W - F ‖}_{2}^{2} + α_{1} ‖ W ‖_{1} + α_{2} T r (W^{T} L^{W} W) + λ T r (F^{T} L_{S} F) + β \sum_{i j} (s_{i j} \log s_{i j}) \\ s . t . W \geq 0, F^{T} F = I, \sum_{j = 1}^{n} s_{i j} = 1, s_{i j} > 0 . \end{array}

(62)

where

\sum_{i j} (s_{i j} \log s_{i j})

is an entropy regularization term to estimate the uniformity of matrix S. Compared with the SPNFSR method, NNSAFS integrates graph learning and feature selection into a framework to overcome the shortcomings of the SPNFSR method. Moreover, local structural information of learned feature is also considered. However, since NNSAFS and SPNFSR are unsupervised methods and do not consider the label information of the data, they cannot select the features with good discriminability.

(3) RLSR is an SSFS method, which identifies key features by learning the global and sparse solutions of the feature projection matrix. It also redefines regression coefficients with a deflation factor, as shown in Equation (63):

\begin{array}{l} \min ‖ X^{T} W + 1 b^{T} - Y | |_{F}^{2} + γ ‖ W | |_{2, 1}^{2} \\ s . t . W, b, Y_{U} \geq 0, Y_{U} 1 = 1 . \end{array}

(63)

Different from the SPNFSR and NNSAFS methods, RLSR is a semi-supervised selection method that can use both labeled and unlabeled samples to improve the discriminability of features. Moreover, it also uses the L₂₁ norm instead of the L₁ norm to reduce the redundancy of selected features.

(4) FDEFS is a supervised or semi-supervised FS method that combines margin discriminant embedding, manifold embedding, and sparse regression to achieve feature selection.

\begin{array}{l} \min μ (‖ W | |_{2, 1} + γ ‖ X^{T} W + 1_{n} b^{T} - F | |_{2}^{2}) + t r (Z^{T} L_{1} Z) \\ s . t . L_{1} = L + λ \tilde{M_{l}} . \end{array}

(64)

where

\tilde{M_{l}}

is a square matrix, and the detailed calculation procedure is provided in [50]. FDEFS can be regarded as an extension of RLSR by combining discriminant embedding terms and manifold embedding terms to enhance the discriminability of selected features.

(5) GS³FS is a robust graph-based SSFS method that selects relevant and sparse features through manifold learning and the L_2p norm imposed on the regularization and loss functions.

\begin{array}{l} \min t r (F^{T} L F) + t r ((F - Y) U {(F - Y)}^{T}) + ‖ X^{T} W + 1_{n} b^{T} - F | |_{2, p}^{p} + λ ‖ W | |_{2, p}^{p} \\ s . t . F \geq 0, W, p \in (0, 1] . \end{array}

(65)

Compared with the FDEFS method, GS³FS first integrates the LP into FDEFS. Moreover, GS³FS uses the L_2p norm instead of the L₂₁ norm to highlight the robustness of the selected features.

(6) S2LFS is a novel SSFS that can select different subsets for different categories rather than selecting one subset for all categories.

\begin{array}{l} \min \sum_{k = 1}^{c} ‖ g_{k} - X^{T} w_{k} | |^{2} + λ \sum_{k = 1}^{c} w_{k}^{T} d i a g (z_{k}^{- 1}) w_{k} + β (t r (G^{T} L G) + t r ({(G - Y)}^{T} U (G - Y)) \\ s . t . G \geq 0, G^{T} G = I_{c}, z_{k} \geq 0, z_{k}^{T} 1_{d} = 1 . \end{array}

(66)

where

z_{k}

is an indicator vector representing whether a feature is chosen or not for the k-th class, and

w_{k}

is the prediction function for the k-th class based on the selected features.

(7) AGLRM uses AGL techniques to enhance similarity matrix construction and mitigate the adverse impact of redundant features by minimizing redundant terms.

\begin{array}{l} \min \{\begin{cases} γ t r (F^{T} L F) + t r ((F - Y) U {(F - Y)}^{T}) + α | | S | |_{F}^{2} + t r (W^{T} X L W^{T} X) \\ + θ t r (W^{T} A W) + | | X^{T} W + 1_{n} b^{T} - F | |_{F}^{2} + λ | | W | |_{2, 1} \end{cases} \\ s . t . 0 \leq S_{i j} \leq 1, S_{i} 1_{n} = 1 . \end{array}

(67)

where A is a matrix of correlation coefficients for evaluating feature correlations.

Although the performance of the AGLRM method is superior to other methods, it still has shortcomings. First, the weight matrix of the graph is constrained by the L₂ norm, which results in the graph lacking a sparse structure. Second, global constraints are not considered in the graph learning process, which leads to neglect of the distribution of the data and failure to explore more effective feature similarity metrics, thus affecting the performance of the method.

(8) ASLCGLFS improves similarity matrix quality by integrating label information into AGL. Additionally, it considers both local and global structures of the samples, thereby reducing redundancy in the selected features.

\begin{array}{l} \min \{\begin{cases} | | X^{T} W - F | |_{F}^{2} + \sum_{i j}^{n} | | W^{T} (X_{i} - X_{j}) | |_{2}^{2} S_{i j} + α | | S - A | |_{F}^{2} + t r (F^{T} L F) \\ + t r ((F - Y) U {(F - Y)}^{T}) + | | W^{T} X - W^{T} X Z | |_{F}^{2} + β | | Z | |_{2, 1} + λ | | W | |_{2, 1} \end{cases} \\ s . t . 0 \leq S_{i j} \leq {1, S}_{i}^{T} 1_{n} = 1, α, β, λ \geq 0 . \end{array}

(68)

As an improvement to AGLRM, ASLCGLFS considers global information. However, the introduction of a predefined similarity matrix may bring in redundant information, which affects the learning performance. Therefore, instead of introducing predefined matrices, we will consider using the introduction of brand-new constraints to learn global and local information and reduce redundancy in order to improve the performance of feature selection.

4.2. Classification Experiments

4.2.1. Classification Datasets

Five publicly available image datasets were used in the classification experiment, which includes four face classification datasets (AR [58], CMU PIE [59], Extended YaleB [60], ORL [61]) and one object classification dataset (COIL20 [62]). Table 5 presents the detailed information of these datasets, in which P₁ and P₂ indicate training and test samples per category, respectively.

The AR dataset is a widely used standard database consisting of more than 4000 color facial images. These images are from 126 faces, including 56 females and 70 males. The images in this dataset have variable expressions, lighting changes, and external occlusions. Figure 3a shows some images from this database.

The CMU PIE dataset consists of 41,368 grayscale facial images of 68 individuals. These images cover subjects of different ages, genders, and skin tones with different postural conditions, lighting environments, and expressions. Figure 3b shows some examples in this dataset.

The Extended YaleB dataset was taken from 38 subjects, and each subject was selected from 64 photos in different poses, different lighting environments, and 5 different shooting angles. This dataset has a total of 2414 face images. Figure 3c shows some images from the Extended YaleB dataset.

The ORL dataset contains 400 images of faces from 40 volunteers. Each volunteer provided 10 images with different facial postures, facial expressions, and facial ornaments obscured, such as serious or smiling, eyes up or squinting, and wearing or not wearing accessories. Some of the examples from this dataset can be observed in Figure 3d.

The COIL20 dataset comprises 1440 images featuring 20 different subjects. A total of 72 images were taken for each subject at 5-degree intervals. Some of the images from COIL20 are shown in Figure 3e.

It should be mentioned that in most existing work, these face databases (AR, CMU PIE, Extended YaleB, and ORL) are commonly used to evaluate the performance of the method because of the following aspects: (1) each database has different numbers of original data and categories; (2) each database contains different types of face variations; (3) each database has different conditions and environments for image acquisition. By using these classical facial datasets to evaluate our proposed method, we can ensure that our experimental results are adequately comparable to previous findings, thus better assessing the novelty and effectiveness of our proposed method in the field of face recognition.

4.2.2. Evaluation Metric

The accuracy rate [63] is employed to measure the performance of SFS-AGGL on the classification task, which is represented as:

A C C = \frac{T P + T N}{T P + F P + F N + T N} \times 100 %

(69)

where TP and TN represent the numbers of correctly identified positive and negative samples. Additionally, false positive (FP) and false negative (FN) signify the misclassification of negative samples as positive and positive samples as negative, respectively. A higher accuracy rate value indicates improved classification performance.

4.2.3. Experimental Setup for Classification Task

In this experiment, P₁ samples are randomly selected from each class for training, and the remaining P₂ samples are used for testing. Then, an FS model is used to select a limited number of relevant features from the training data, and the model’s effectiveness is assessed using KNN on the testing samples with only a subset of features. For the sake of experiment fairness and reliability, each experiment is conducted 10 times using diverse training data, and the final experimental results are represented as average classification accuracy and standard deviation. In addition, to select the optimal parameters, we used the grid search method to find the optimal values of parameters α, β, θ, and λ in the range {0.001, 0.01, 0.1, 1, 10, 100, 1000} and the optimal number of iterations a in {100, 200, 300, 400, 500, 600}. The dimensions of selected features vary from 50 to 500 in increments of 50.

4.2.4. Analysis of Classification Results

(1) Parameter sensitivity analysis of classification

The effects of feature dimension (d), number of iterations (iter), and four balance parameters (α, β, θ, λ) on the performance of SFS-AGGL in the classification task are investigated. To assess SFS-AGGL’s performance across varied experimental scenarios, the number of feature dimensions, iteration times, and the values of four balancing parameters were adjusted.

First, we have demonstrated the influence of different iteration times on the performance of SFS-AGGL, with the remaining parameters set to their optimal values. As shown in Figure 4, the classification accuracy varies with the iterations, showing an increasing trend. However, the classification accuracy will decrease or remain stable with an increasing iteration after reaching its peak. This demonstrates that SFS-AGGL can reduce the impact of noise and redundant features and effectively overcome overfitting problems.

Second, the performance of different methods in different feature dimensions is shown in Figure 5. From Figure 5, we can find that the accuracy obtained by all methods is relatively lower when the feature dimensions are smaller. On the contrary, the performance of all methods gradually improves as the number of selected features increases. In most cases, the proposed SFS-AGGL outperforms the comparison methods, which indicates the stronger discriminative ability of the features selected by SFS-AGGL. However, the performance of some methods decreases as the number of selected features increases. This may be due to the presence of redundant or noisy information features in higher dimensions. Nevertheless, SFS-AGGL still surpasses the comparison methods in classification accuracy. The experimental results further validate the enhanced robustness of the features chosen by the SFS-AGGL method.

Third, the performance of the proposed SFS-AGGL with different values of the four balancing parameters α, β, θ, and λ on different datasets is tested. The classification results for each balance parameter are depicted in Figure 6. From Figure 6, the following conclusions can be drawn:

(1) The parameter α is used to control LP. The performance of SFS-AGGL is very sensitive to parameter α on different datasets.

(2) The parameter β affects the performance of AGL. SFS-AGGL achieves the best performance when β is set to 0.01 for the AR dataset and β is set to 0.1 for other datasets. In addition, the classification accuracy of SFS-AGGL on the ORL dataset is insensitive to different values of β. In contrast, the classification performance is very sensitive to the parameter β on other datasets. Therefore, β should be set to a smaller value to obtain better classification results.

(3) The parameter θ determines the significance of the sparse feature projection terms. The performance of SFS-AGGL is insensitive to parameter θ on the ORL, COIL20, and AR datasets, but it is very sensitive on the Extended YaleB and CMU PIE datasets.

(4) The parameter λ determines the importance of global and local constraint terms. SFS-AGGL achieves high accuracy on each dataset when the value of λ is small. However, the performance of SFS-AGGL decreases with increasing λ for the CMU PIE, Extended YaleB, and AR datasets. This indicates that there is significant variation among intraclass samples in these datasets. Therefore, λ should be set to a smaller value in the case of large differences between intraclass samples.

In summary, different values of the balancing parameters will have different effects on different datasets. The optimal parameter combinations for each dataset are listed in Table 6.

(2) Comparative analysis of classification performance

First, this section validates the classification performance of SFS-AGGL compared to other methods on the five image datasets. Table 7 presents the optimal average classification accuracy and their corresponding standard deviations for the different methods. The results in Table 7 show that: (1) SSFS methods outperform the UFS method, which indicates that the guidance of a small number of labels is crucial to improving the performance; (2) the joint FS algorithms achieve better performances than that of the RLSR method, which indicates that the correlation information among features is important for improving the FS performance; (3) the semi-supervised methods RLSR and FDEFS are inferior to other semi-supervised methods, which demonstrates that introducing the LP algorithm into semi-supervised methods is favorable for selecting discriminative features; (4) the proposed SFS-AGGL method outperforms the ASLCGLF method, notably since it integrates global and local constraints into AGL. Therefore, it is beneficial to fully consider LP and AGL in the SSFS approach to improve performance.

Then, to demonstrate the superiority of SFS-AGGL, we employed one-tailed t-tests to determine if SFS-AGGL significantly outperformed the comparison methods. Both the null hypothesis and alternative hypotheses assumed that the results achieved by SFS-AGGL were equal to or greater than the results obtained by the comparison methods. For instance, in comparing SFS-AGGL with RLSR (SFS-AGGL vs. RLSR), the hypotheses are defined as H0: SFS-AGGL = RLSR and H1: SFS-AGGL > RLSR, where SFS-AGGL and RLSR represent average classification results obtained by SFS-AGGL and RLSR on different datasets, respectively. The experiment sets a statistical significance level of 0.05, and Table 8 presents the p values of pairwise one-tailed t-tests on different datasets.

From Table 8, it can be seen that the performance of all methods is comparable on ORL and COIL datasets since these two datasets are relatively simple compared with other datasets, but the accuracy of our method is still slightly higher than that of other methods. Moreover, for AR, CMU PIE, and Extended YaleB databases, our method was able to significantly outperform the other comparative methods, indicating that our method is more advantageous in dealing with complex datasets.

4.3. Clustering Experiments

This section validates the effectiveness of the SFS-AGGL method for clustering tasks. For this purpose, we used the face dataset ORL and the object dataset COIL20, as well as two UCI datasets (Libras Movement and Landsat [64]) in the experiment.

4.3.1. Clustering Datasets

The Libras Movement dataset contains 15 gestures with a total of 360 samples and 89 attributes, while the Landsat dataset contains multispectral images of six different geographic regions with a total of 296 samples and 36 attributes. The details of all clustering datasets used are shown in Table 9.

4.3.2. Evaluation Metrics

Multiple metrics, such as ACC, NMI, purity, ARI, F-score, precision, and recall [65], are applied to evaluate the clustering performance.

ACC represents clustering accuracy, which is defined as:

A C C = \frac{\sum_{i = 1}^{n} δ (y_{i}, m a p ({\bar{y}}_{i}))}{n}

(70)

where

δ (x, y) = \{\begin{cases} 1, i f x = y \\ 0, o t h e r w i s e \end{cases}

, n is the total number of samples,

y_{i}

and

{\bar{y}}_{i}

denote the ground truth label and clustering label of the i-th sample, respectively, and where

m a p (\cdot)

is a function that maps the learned clustering labels to align with the ground-truth labels.

NMI is the normalized mutual information for clustering, which is defined as:

N M I = \frac{M I (H, V)}{\sqrt{H (U) \cdot H (V)}}

(71)

where MI denotes the mutual information, i.e., the entropy of the two sets, U and V. MI has been normalized to ensure fair comparisons between sets of different sizes.

ARI is the adjusted Rand index, which is defined as:

A R I = \frac{R I - E x p e c t e d_R I}{\max (R I_m a x) - E x p e c t e d_R I}

(72)

where RI (Rand index) denotes the number of sample pairs that are correctly clustered by the clustering algorithm out of all sample pairs; Expected_RI denotes the expected Rand index obtained through random clustering; and max(RI_max) indicates the maximum possible Rand index. The RI is adjusted to account for randomness, with values ranging between −1 and 1, where a value closer to 1 indicates better clustering performance.

Purity measures the proportion of true categories that dominate each cluster.

P u r i t y = \frac{1}{N} \sum_{k} \max_{j} | C_{k} \cap G_{j} |

(73)

where

C_{k}

denotes the k-th cluster,

G_{j}

denotes the j-th true category, and N denotes total number of samples.

Precision reflects the ratio of correctly clustered positive samples to all samples identified as positive.

P r e c i s i o n = \frac{T P}{T P + F P}

(74)

Recall indicates the proportion of positive samples that were correctly clustered with all actual positive samples.

R e c a l l = \frac{T P}{T P + F N}

(75)

F-score is the harmonic mean of precision and recall, providing a comprehensive assessment of both performance metrics.

F - s c o r e = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(76)

4.3.3. Experimental Setup for Clustering

In this experiment, we set the four parameters (α, β, θ, and λ) with range {0.001, 0.01, 0.1, 1, 10, 100, 1000} for all datasets and the dimensions (d) with range {50, 100, 150, 200, 250, 300, 350, 400, 450, 500}, {8, 16, 24, 32, 40, 48, 56, 64, 72, 80}, and {3, 6, 9, 12, 15, 18, 21, 24, 27, 30} for different datasets, respectively.

4.3.4. Analysis of Clustering Results

(1) Parameter sensitivity analysis of clustering

Figure 7 illustrates the clustering results of SFS-AGGL on four datasets with varying parameters. When the selected feature dimension is unchanged, the parameter α first increases, then decreases, and finally rises again. The performance of SFS-AGGL is sensitive to different parameter values on different datasets, which underscores the importance of adjusting these values to achieve optimal clustering performance. Smaller values of regularization parameters β and λ can yield improved overall performance on diverse datasets. This demonstrates that our proposed SFS-AGGL can not only acquire neighboring information in the projected feature space but also capture the global and local sparse structures in the original feature space, ultimately leading to good performance. The performance of SFS-AGGL first improves and then decreases as the regularization parameter λ increases on the COIL20 and Landsat datasets. This indicates that SFS-AGGL is more sensitive to sparse learning in space. In summary, setting all balance parameters to smaller values enhances the clustering results of SFS-AGGL. Furthermore, it is advisable to adjust parameter values tailored to each dataset to achieve optimal outcomes.

Figure 8 shows the clustering results obtained by sequentially setting each balancing parameter to different values while keeping all other conditions at optimal values. It can be found that the performance of SFS-AGGL is insensitive to all parameters in most cases. Notably, the clustering accuracy of SFS-AGGL on the ORL dataset is relatively sensitive to an increase in the parameter β. Therefore, it is recommended to set β to a larger value for optimizing clustering performance.

(2) Comparative analysis of clustering performance

In this experiment, the k-means method is adopted to cluster the low-dimensional features selected by each FS method. To minimize the impact of initialization on the k-means method, we performed 10 clustering experiments with varied random initializations. Table 10, Table 11, Table 12 and Table 13 display the average values and standard deviations of ACC, NMI, purity, ARI, F-score, precision, and recall for the RLSR, FDEFS, GS³FS, S2LFS, AGLRM, ASLCGLFS, and SFS-AGGL methods on the ORL, COIL20, Libras Movement, and Landsat datasets. These results further illustrate the superiority of the proposed SFS-AGGL compared to other comparative methods.

4.4. Convergence and Runtime Analysis

In this section, experiments were performed on seven databases to assess the convergence and runtime of the proposed SFS-AGGL method. Figure 9 shows the convergence curve of SFS-AGGL. From Figure 9, we can see that the objective function values of the SFS-AGGL methods only require less than 100 iterations to reach convergence, which validates the efficiency of the proposed iterative optimization method. Table 14 displays the runtime of SFS-AGGL when iteration is set to 100 and feature dimensions are set to 500. The results in Table 14 clearly indicate that the runtime of our proposed method is slightly higher than that of AGLRM but lower than that of other methods. It is noteworthy that the runtime of SFS-AGGL is lower than that of all comparative methods after GPU optimization.

5. Conclusions and Discussion

This paper proposes the semi-supervised feature selection based on an adaptive graph with global and local constraints (SFS-AGGL) algorithm. This algorithm considers the sample neighborhood structure within the projected feature space, dynamically learns the optimal nearest neighbor graph among samples, and maintains global and local sparse structures within the selected feature subset. This ensures the preservation of the original data’s geometric structural information. Moreover, it can effectively leverage structural distribution information from labeled data to derive label information from unlabeled samples. The incorporation of the L₂₁ norm in the SFS model enhances its resilience to noisy features. The iterative optimization approach employed to solve parameter optimal solutions is validated, confirming the convergence of the SFS-AGGL algorithm. Extensive experiments on real datasets validate the classification and clustering performance of the proposed SFS-AGGL method. Although our method can achieve good performance, there are still several issues that need to be pointed out, which are as follows:

1. Since the proposed method has considered the correlation and geometric structure of the data, it is suitable for the features of data with significant correlation, meanwhile, the distribution of data has a certain local structure.

2. Since our proposed method only considers the local and global structural information of the data, its application will be limited in certain datasets.

3. The proposed method cannot effectively extract effective features from the data with complex nonlinear structures because it is a linear feature selection method.

To overcome the above-mentioned shortcomings, we will try to do the following work in the future:

1. We will introduce other constraints to comprehensively capture and represent the structural information of the data.

2. We will integrate the idea of deep learning into the feature selection process to extract effective features from highly unstructured data.

Author Contributions

Data curation, Y.Y., H.Z., N.Z., G.X. and X.H.; Formal analysis, X.H., H.Z., N.Z., X.H. and G.X.; Methodology, Y.Y., H.Z., W.Z. and C.Z.; Resources, Y.Y., H.Z., N.Z. and W.Z.; Supervision, W.Z. and C.Z.; Writing—original draft, Y.Y. and H.Z.; Writing—review and editing, Y.Y., H.Z., W.Z. and C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by grants from the National Natural Science Foundation of China (Nos. 62062040 and 62006174), the Outstanding Youth Project of Jiangxi Natural Science Foundation (No. 20212ACB212003), the Jiangxi Province Key Subject Academic and Technical Leader Funding Project (No. 20212BCJ23017), the Science and Technology Research Project of Jiangxi Provincial Department of Education (No. GJJ210330), and the Fund of the Jilin Provincial Science and Technology Department (No. 20220201157GX).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data were derived from public domain resources.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wen, J.; Yang, S.; Wang, C.D.; Jiang, Y.; Li, R. Feature-splitting Algorithms for Ultrahigh Dimensional Quantile Regression. J. Econom. 2023, 2023, 105426. [Google Scholar] [CrossRef]
Lue, X.; Long, L.; Deng, R.; Meng, R. Image feature extraction based on fuzzy restricted Boltzmann machine. Measurement 2022, 204, 112063. [Google Scholar] [CrossRef]
Sheikhpour, R.; Sarram, M.A.; Gharaghani, S.; Chahooki, M.A.Z. A survey on semi-supervised feature selection methods. Pattern Recognit. 2017, 64, 141–158. [Google Scholar] [CrossRef]
Mafarja, M.; Qasem, A.; Heidari, A.A.; Aljarah, I.; Faris, H.; Mirjalili, S. Efficient hybrid nature-inspired binary optimizers for feature selection. Cogn. Comput. 2020, 12, 150–175. [Google Scholar] [CrossRef]
Huang, G.Y.; Hung, C.Y.; Chen, B.W. Image feature selection based on orthogonal ℓ_2,0 norms. Measurement 2022, 199, 111310. [Google Scholar] [CrossRef]
Cai, J.; Luo, J.; Wang, S.; Yang, S. Feature selection in machine learning: A new perspective. Neurocomputing 2018, 300, 70–79. [Google Scholar] [CrossRef]
Solorio-Fernández, S.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F. A systematic evaluation of filter Unsupervised Feature Selection methods. Expert Syst. Appl. 2020, 162, 113745. [Google Scholar] [CrossRef]
Bhadra, T.; Bandyopadhyay, S. Supervised feature selection using integration of densest subgraph finding with floating forward–backward search. Inf. Sci. 2021, 566, 1–18. [Google Scholar] [CrossRef]
Mann, G.S.; McCallum, A. Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data. J. Mach. Learn. Res. 2010, 11, 955–984. [Google Scholar]
Hou, C.; Nie, F.; Li, X.; Yi, D.; Wu, Y. Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Trans. Cybern. 2013, 44, 793–804. [Google Scholar]
Wang, L.; Jiang, S.; Jiang, S. A feature selection method via analysis of relevance, redundancy, and interaction. Expert Syst. Appl. 2021, 183, 115365. [Google Scholar] [CrossRef]
Dokeroglu, T.; Deniz, A.; Kiziloz, H.E. A comprehensive survey on recent metaheuristics for feature selection. Neurocomputing 2022, 494, 2966. [Google Scholar] [CrossRef]
Nie, F.; Zhu, W.; Li, X. Structured graph optimization for unsupervised feature selection. IEEE Trans. Knowl. Data Eng. 2019, 33, 1210–1222. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, H. Semi-supervised feature selection via spectral analysis. In Proceedings of the 2007 SIAM International Conference on Data Mining; Society for Industrial and Applied Mathematics, Minneapolis, MN, USA, 26–28 April 2007; pp. 641–646. [Google Scholar]
Toğaçar, M.; Ergen, B.; Cömert, Z. Classification of flower species by using features extracted from the intersection of feature selection methods in convolutional neural network models. Measurement 2020, 158, 107703. [Google Scholar] [CrossRef]
Chen, X.; Song, L.; Hou, Y.; Shao, G. Efficient semi-supervised feature selection for VHR remote sensing images. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 1500–1503. [Google Scholar]
Peng, S.; Lu, J.; Cao, J.; Peng, Q.; Yang, Z. Adaptive graph regularization method based on least square regression for clustering. Signal Process. Image Commun. 2023, 114, 116938. [Google Scholar] [CrossRef]
Chang, X.; Nie, F.; Yang, Y.; Huang, H. A convex formulation for semi-supervised multi-label feature selection. In Proceedings of the AAAI Conference on Artificial Intelligence, Québec City, QC, Canada, 27–31 July 2014; Volume 28. [Google Scholar]
Chen, X.; Yuan, G.; Nie, F.; Huang, J.Z. Semi-supervised feature selection via rescaled linear regression. In Proceedings of the Twenty Sixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 1525–1531. [Google Scholar]
Chen, X.; Chen, R.; Wu, Q.; Nie, F.; Yang, M.; Mao, R. Semi supervised feature selection via structured manifold learning. IEEE Trans. Cybern. 2021, 52, 5756–5766. [Google Scholar] [CrossRef]
Liu, Z.; Lai, Z.; Ou, W.; Zhang, K.; Zheng, R. Structured optimal graph based sparse feature extraction for semi-supervised learning. Signal Process. 2020, 170, 107456. [Google Scholar] [CrossRef]
Akbar, S.; Hayat, M.; Tahir, M.; Chong, K.T. cACP-2LFS: Classification of anticancer peptides using sequential discriminative model of KSAAP and two-level feature selection approach. IEEE Access 2020, 8, 131939–131948. [Google Scholar] [CrossRef]
Bakir-Gungor, B.; Hacilar, H.; Jabeer, A.; Nalbantoglu, O.U.; Aran, O.; Yousef, M. Inflammatory bowel disease biomarkers of human gut microbiota selected via ensemble feature selection methods. PeerJ 2022, 10, e13205. [Google Scholar] [CrossRef]
Ahmed, N.; Rafiq, J.I.; Islam, M.R. Enhanced human activity recognition based on smartphone sensor data using hybrid feature selection model. Sensors 2020, 20, 317. [Google Scholar] [CrossRef]
López, D.; Ramírez-Gallego, S.; García, S.; Xiong, N.; Herrera, F. BELIEF: A distance-based redundancy-proof feature selection method for Big Data. Inf. Sci. 2021, 558, 124–139. [Google Scholar] [CrossRef]
Chen, X.; Yuan, G.; Wang, W.; Nie, F.; Chang, X.; Huang, J.Z. Local adaptive projection framework for feature selection of labeled and unlabeled data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 6362–6373. [Google Scholar] [CrossRef] [PubMed]
Cheng, B.; Yang, J.; Yan, S.; Fu, Y.; Huang, T.S. Learning with l1-graph for image analysis. IEEE Trans. Image Process. 2009, 19, 858–866. [Google Scholar] [CrossRef]
Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 171–184. [Google Scholar] [CrossRef]
Singh, R.P.; Ojha, D.; Jadon, K.S. A Survey on Various Representation Learning of Hypergraph for Unsupervised Feature Selection. In Data, Engineering and Applications: Select Proceedings of IDEA 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 71–82. [Google Scholar]
Elhamifar, E.; Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2765–2781. [Google Scholar] [CrossRef]
Zhong, G.; Pun, C.M. Subspace clustering by simultaneously feature selection and similarity learning. Knowl. Based Syst. 2020, 193, 105512. [Google Scholar] [CrossRef]
Wan, Y.; Sun, S.; Zeng, C. Adaptive similarity embedding for unsupervised multi-view feature selection. IEEE Trans. Knowl. Data Eng. 2020, 33, 3338–3350. [Google Scholar] [CrossRef]
Shang, R.; Song, J.; Jiao, L.; Li, Y. Double feature selection algorithm based on low-rank sparse non-negative matrix factorization. Int. J. Mach. Learn. Cybern. 2020, 11, 1891–1908. [Google Scholar] [CrossRef]
Zhu, J.; Jang-Jaccard, J.; Liu, T.; Zhou, J. Joint spectral clustering based on optimal graph and feature selection. Neural Process. Lett. 2021, 53, 257–273. [Google Scholar] [CrossRef]
Sha, Y.; Faber, J.; Gou, S.; Liu, B.; Li, W.; Schramm, S.; Stoecker, H.; Steckenreiter, T.; Vnucec, D.; Wetzstein, N.; et al. An acoustic signal cavitation detection framework based on XGBoost with adaptive selection feature engineering. Measurement 2022, 192, 110897. [Google Scholar] [CrossRef]
Zhu, P.; Hou, X.; Tang, K.; Liu, Y.; Zhao, Y.P.; Wang, Z. Unsupervised feature selection through combining graph learning and ℓ2, 0-norm constraint. Inf. Sci. 2023, 622, 68–82. [Google Scholar] [CrossRef]
Mei, S.; Zhao, W.; Gao, Q.; Yang, M.; Gao, X. Joint feature selection and optimal bipartite graph learning for subspace clustering. Neural Netw. 2023, 164, 408–418. [Google Scholar] [CrossRef] [PubMed]
Zhou, P.; Du, L.; Li, X.; Shen, Y.D.; Qian, Y. Unsupervised feature selection with adaptive multiple graph learning. Pattern Recognit. 2020, 105, 107375. [Google Scholar] [CrossRef]
Bai, X.; Zhu, L.; Liang, C.; Li, J.; Nie, X.; Chang, X. Multi-view feature selection via nonnegative structured graph learning. Neurocomputing 2020, 387, 110–122. [Google Scholar] [CrossRef]
Zhou, P.; Chen, J.; Du, L.; Li, X. Balanced spectral feature selection. IEEE Trans. Cybern. 2022, 53, 4232–4244. [Google Scholar] [CrossRef]
Miao, J.; Yang, T.; Sun, L.; Fei, X.; Niu, L.; Shi, Y. Graph regularized locally linear embedding for unsupervised feature selection. Pattern Recognit. 2022, 122, 108299. [Google Scholar] [CrossRef]
Xie, G.B.; Chen, R.B.; Lin, Z.Y.; Gu, G.S.; Yu, J.R.; Liu, Z.; Cui, J.; Lin, L.; Chen, L. Predicting lncRNA–disease associations based on combining selective similarity matrix fusion and bidirectional linear neighborhood label propagation. Brief. Bioinform. 2023, 24, bbac595. [Google Scholar] [CrossRef]
Sheikhpour, R.; Sarram, M.A.; Gharaghani, S.; Chahooki, M.A.Z. A robust graph-based semi-supervised sparse feature selection method. Inf. Sci. 2020, 531, 13–30. [Google Scholar] [CrossRef]
Li, Z.; Tang, J. Semi-supervised local feature selection for data classification. Sci. China Inf. Sci. 2021, 64, 192108. [Google Scholar] [CrossRef]
Jiang, B.; Wu, X.; Zhou, X.; Liu, Y.; Cohn, A.G.; Sheng, W.; Chen, H. Semi-supervised multiview feature selection with adaptive graph learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–15. [Google Scholar] [CrossRef]
Shang, R.; Zhang, X.; Feng, J.; Li, Y.; Jiao, L. Sparse and low-dimensional representation with maximum entropy adaptive graph for feature selection. Neurocomputing 2022, 485, 57–73. [Google Scholar] [CrossRef]
Lai, J.; Chen, H.; Li, T.; Yang, X. Adaptive graph learning for semi-supervised feature selection with redundancy minimization. Inf. Sci. 2022, 609, 465–488. [Google Scholar] [CrossRef]
Lai, J.; Chen, H.; Li, W.; Li, T.; Wan, J. Semi-supervised feature selection via adaptive structure learning and constrained graph learning. Knowl.-Based Syst. 2022, 251, 109243. [Google Scholar] [CrossRef]
Luo, T.; Hou, C.; Nie, F.; Tao, H.; Yi, D. Semi-supervised feature selection via insensitive sparse regression with application to video semantic recognition. IEEE Trans. Knowl. Data Eng. 2018, 30, 1943–1956. [Google Scholar] [CrossRef]
Moosaei, H.; Hladík, M. Sparse solution of least-squares twin multi-class support vector machine using ℓ0 and ℓp-norm for classification and feature selection. Neural Netw. 2023, 166, 471–486. [Google Scholar] [CrossRef]
Favati, P.; Lotti, G.; Menchi, O.; Romani, F. Construction of the similarity matrix for the spectral clustering method: Numerical experiments. J. Comput. Appl. Math. 2020, 375, 112795. [Google Scholar] [CrossRef]
Qu, J.; Zhao, X.; Xiao, Y.; Chang, X.; Li, Z.; Wang, X. Adaptive Manifold Graph representation for Two-Dimensional Discriminant Projection. Knowl.-Based Syst. 2023, 266, 110411. [Google Scholar] [CrossRef]
Ma, Z.; Wang, J.; Li, H.; Huang, Y. Adaptive graph regularized non-negative matrix factorization with self-weighted learning for data clustering. Appl. Intell. 2023, 53, 28054–28073. [Google Scholar] [CrossRef]
Yang, S.; Wen, J.; Zhan, X.; Kifer, D. ET-lasso: A new efficient tuning of lasso-type regularization for high-dimensional data. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 607–616. [Google Scholar]
Huang, S.; Xu, Z.; Wang, F. Nonnegative matrix factorization with adaptive neighbors. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 486–493. [Google Scholar]
Zhou, W.; Wu, C.; Yi, Y.; Luo, G. Structure preserving non-negative feature self-representation for unsupervised feature selection. IEEE Access 2017, 5, 8792–8803. [Google Scholar] [CrossRef]
Shang, R.; Zhang, W.; Lu, M.; Jiao, L.; Li, Y. Feature selection based on non-negative spectral feature learning and adaptive rank constraint. Knowl.-Based Syst. 2022, 236, 107749. [Google Scholar] [CrossRef]
Martinez, A.; Benavente, R. The AR Face Database: CVC Technical Report; Computer Vision Center: Barcelona, Spain, 1998; Volume 24. [Google Scholar]
Sim, T.; Baker, S.; Bsat, M. The CMU pose, illumination, and expression (PIE) database. In Proceedings of the Fifth IEEE International Conference on Automatic Face Gesture Recognition, Washington, DC, USA, 20–21 May 2002; pp. 53–58. [Google Scholar]
Zhang, L.; Zhang, L.; Zhang, D.; Zhu, H. Online finger-knuckle-print verification for personal authentication. Pattern Recognit. 2010, 43, 2560–2571. [Google Scholar] [CrossRef]
Samaria, F.S.; Harter, A.C. Parameterisation of a stochastic model for human face identification. In Proceedings of the 1994 IEEE Workshop on Applications of Computer Vision, Seattle, WA, USA, 21–23 June 1994; pp. 138–142. [Google Scholar]
Nene, S.A.; Nayar, S.K.; Murase, H. Columbia Object Image Library (COIL-20); Columbia University: New York, NY, USA, 1996. [Google Scholar]
Yi, Y.; Lai, S.; Li, S.; Dai, J.; Wang, W.; Wang, J. RRNMF-MAGL: Robust regularization non-negative matrix factorization with multi-constraint adaptive graph learning for dimensionality reduction. Inf. Sci. 2023, 640, 119029. [Google Scholar] [CrossRef]
Blake, C.L.; Merz, C.J. UCI Repository of Machine Learning Databases; Department of Information and Computer Science, University of California: Irvine, CA, USA, 1998; p. 55. [Google Scholar]
Li, Z.; Tang, C.; Zheng, X.; Liu, X.; Zhang, W.; Zhu, E. High-order correlation preserved incomplete multi-view subspace clustering. IEEE Trans. Image Process. 2022, 31, 2067–2080. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The illustration of the SFS-AGGL framework.

Figure 2. Flow chart of SFS-AGGL algorithm.

Figure 3. Sample images of five datasets.

Figure 4. Classification accuracy of SFS-AGGL under different iterations.

Figure 5. Classification accuracy of different methods under different feature dimensions.

Figure 6. Classification results of SFS-AGGL under different parameter values.

Figure 7. Clustering results of SFS-AGGL under different parameter values and different feature dimensions, where different colors represent different feature dimensions.

Figure 8. Clustering results of SFS-AGGL under different parameter values.

Figure 9. Convergence curves of SFS-AGGL on different datasets.

Table 1. Definition of the main symbols in this paper.

Notation	Description	Notation	Description
$X \in R^{d \times n}$	Sample matrix	$0 \in R^{u \times c}$	Zero matrix
$X_{l} \in R^{d \times l}$	Labeled sample matrix	$d$	Sample dimension
$X_{u} \in R^{d \times (n - l)}$	Unlabeled sample matrix	$n$	Sample size
$Y \in R^{n \times c}$	Label matrix	$k$	Number of selected features
$F \in R^{n \times c}$	Predictive labeling matrix	$c$	Number of categories
$S \in R^{n \times n}$	Weighting matrix	$l$	Number of label samples
$E \in R^{n \times n}$	Local adaptation matrix	$+ \infty$	Infinitely large numbers
$W \in R^{d \times c}$	Weighting matrix	$⊙$	Matrix dot product
$I \in R^{u \times u}$	Unit matrix	$t r (\cdot)$	Traces of matrix

Table 2. The time complexity of each matrix in our proposed algorithm.

Matrix	Formula	Time Complexity
$U$	$U = [u_{i i}] \in R^{n \times n}$	$O (n^{2})$
$E$	$E = [e_{i j}] \in R^{n \times n}$	$O (k n^{2})$
$W$	$W_{i j} = W_{i j} \frac{{[X F + 2 β X S^{T} X^{T} W]}_{i j}}{{[X X^{T} W + β X X^{T} W + β X S S^{T} X^{T} W + θ H W]}_{i j}}$	$O (c m n)$
$F$	$F_{i j} = F_{i j} \frac{{[X^{T} W - Y U^{T}]}_{i j}}{{[F + α L F + F U]}_{i j}}$	$O (c m n)$
$S$	$S_{i j} = S_{i j} \frac{{[α F F^{T} + 2 β X^{T} W W^{T} X]}_{i j}}{{[2 β X^{T} W W^{T} X S + λ E]}_{i j}}$	$O (c n^{2})$

Table 3. Computational complexity of each iteration for FS methods.

Method	Number of Variables	Algorithm Complexity
RLSR [19]	2	$O (i t e r \times \max (n d c, n^{3}))$
FDEFS [49]	3	$O (\max (c m n, c n^{2}))$
GS³FS [43]	4	$O (i t e r \times \max (d^{3}, n^{3}))$
S2LFS [44]	3	$O (c d^{2} n + c d^{3} + c n^{2})$
AGLRM [47]	4	$O (i t e r \times \max (d^{3}, n^{3}))$
ASLCGLFS [48]	4	$O (i t e r \times \max (n^{3}, d^{3}))$
SFS-AGGL	3	$O (\max (k n^{2}, c n^{2}) + i t e r \times \max (c m n, n^{2}))$

Table 4. First- and second-order derivatives of each formula.

W	$ψ_{i j} (W_{i j})$	$\begin{array}{l} ψ_{i j} (W_{i j}) & = [X^{T} W W^{T} X - 2 F W^{T} X + β W^{T} X X^{T} W - 2 β W^{T} X S^{T} X^{T} W \\ + β W^{T} X S S^{T} X^{T} W + θ W^{T} H W]_{i j} \end{array}$
	${ψ^{'}}_{i j} (W_{i j})$	$\begin{array}{l} {ψ^{'}}_{i j} (W_{i j}) & = 2 [X X^{T} W - X F + β X X^{T} W - 2 β X S^{T} X^{T} W \\ + β X S S^{T} X^{T} W + θ H W]_{i j} \end{array}$
	${ψ^{″}}_{i j} (W_{i j})$	${ψ^{″}}_{i j} (W_{i j}) = 2 {[X X^{T} + β X X^{T} - 2 β X S X^{T} + β X S S^{T} X^{T} + θ H^{T}]}_{i i}$
F	$ψ_{i j} (F_{i j})$	$ψ_{i j} (F_{i j}) = {[- 2 F W^{T} X + F F^{T} + α F^{T} L F + F U F^{T} - 2 F U Y^{T}]}_{i j}$
	${ψ^{'}}_{i j} (F_{i j})$	${ψ^{'}}_{i j} (F_{i j}) = 2 {[X^{T} W + F + α L F + F U - Y U^{T}]}_{i j}$
	${ψ^{″}}_{i j} (F_{i j})$	${ψ^{″}}_{i j} (F_{i j}) = 2 {[E + α L^{T} + U^{T}]}_{i i}$
S	$ψ_{i j} (S_{i j})$	$ψ_{i j} (S_{i j}) = {[- α F^{T} S F - 2 β W^{T} X S^{T} X^{T} W + β W^{T} X S S^{T} X^{T} W + λ S E]}_{i j}$
	${ψ^{'}}_{i j} (S_{i j})$	${ψ^{'}}_{i j} (S_{i j}) = {[- α F F^{T} - 2 β X^{T} W W^{T} X + 2 β X^{T} W W^{T} X S + λ E]}_{i j}$
	${ψ^{″}}_{i j} (S_{i j})$	${ψ^{″}}_{i j} (S_{i j}) = 2 {[2 β X^{T} W W^{T} X]}_{i i}$

Table 5. Details of the five image datasets.

Dataset	Size of Image	Size of Classes	Size per Class	P₁	P₂	Type
AR	32 × 32	100	14	7	7	Face
CMU PIE	32 × 32	68	24	12	12	Face
Extended YaleB	32 × 32	38	64	20	44	Face
ORL	32 × 32	40	10	7	3	Face
COIL20	32 × 32	20	72	20	52	Object

Table 6. Optimal parameter combination for SFS-AGGL on the five datasets.

Dataset	{d, t, α, β, θ, λ}
AR	{400, 200, 1, 0.01, 0.01, 0.001}
CMU PIE	{200, 200, 10, 0.1, 0.001, 0.001}
Extended YaleB	{300, 100, 10, 0.1, 10, 0.001}
ORL	{500, 100, 1000, 0.1, 0.001, 0.001}
COIL20	{150, 100, 0.1, 0.1, 10, 1}

Table 7. Best results of each method on five image datasets (ACC).

Method	AR	CMU PIE	Extended YaleB	ORL	COIL20
NNSAFS	63.90 ± 2.12 (400)	85.29 ± 0.64 (500)	62.58 ± 1.39 (300)	92.17 ± 1.81 (500)	93.56 ± 1.32 (100)
SPNFSR	64.50 ± 0.91 (200)	86.22 ± 1.02 (300)	64.02 ± 1.95 (300)	92.83 ± 1.81 (500)	94.21 ± 1.42 (400)
RLSR	64.37 ± 1.58 (500)	84.66 ± 1.25 (500)	64.57 ± 0.87 (300)	95.67 ± 1.75 (500)	93.73 ± 1.25 (500)
FDEFS	63.51 ± 1.29 (500)	85.85 ± 1.06 (500)	65.01 ± 1.00 (500)	96.25 ± 1.37 (500)	94.35 ± 1.42 (450)
GS³FS	63.90 ± 1.37 (450)	85.83 ± 0.68 (500)	61.85 ± 1.18 (500)	96.25 ± 1.48 (450)	93.38 ± 1.35 (500)
S2LFS	64.20 ± 1.48 (500)	87.50 ± 0.85 (500)	64.67 ± 0.82 (500)	96.42 ± 1.62 (400)	94.95 ± 1.18 (500)
AGLRM	64.39 ± 1.58 (450)	86.90 ± 0.72 (450)	61.89 ± 1.13 (500)	96.08 ± 1.42 (500)	95.09 ± 1.48 (200)
ASLCGLFS	67.07 ± 1.62 (250)	87.71 ± 1.12 (150)	64.36 ± 1.19 (400)	96.25 ± 1.37 (500)	95.31 ± 1.13 (100)
SFS-AGGL	68.03 ± 1.58(400)	88.97 ± 1.11(200)	66.35 ± 1.22 (300)	96.42 ± 1.31(500)	95.80 ± 1.16(500)

Numbers in parentheses denote the feature dimensions yielding the optimal results.

Table 8. p values of the pairwise one-tailed t-tests on five image datasets.

Method	AR	CMU PIE	Extended YaleB	ORL	COIL20
RLSR vs. SFS-AGGL	3.14 × 10⁻⁵	4.40 × 10⁻⁸	6.97 × 10⁻⁴	7.03 × 10⁻¹	6.24 × 10⁻⁴
FDEFS vs. SFS-AGGL	7.82 × 10⁻⁷	1.03 × 10⁻⁶	0.74 × 10⁻²	9.44 × 10⁻¹	1.12 × 10⁻²
GS³FS vs. SFS-AGGL	3.36 × 10⁻⁶	6.90 × 10⁻⁸	5.95 × 10⁻⁸	9.36 × 10⁻¹	2.14 × 10⁻⁴
S2LFS vs. SFS-AGGL	1.29 × 10⁻⁵	1.10 × 10⁻³	9.62 × 10⁻⁴	9.53 × 10⁻¹	6.23 × 10⁻²
AGLRM vs. SFS-AGGL	1.55 × 10⁻⁵	1.96 × 10⁻⁵	5.10 × 10⁻⁸	8.84 × 10⁻¹	1.23 × 10⁻¹
ASLCGLFS vs. SFS-AGGL	9.87 × 10⁻²	7.50 × 10⁻³	8.17 × 10⁻⁴	9.44 × 10⁻¹	1.76 × 10⁻¹

Table 9. Details of four clustering datasets.

Dataset	Number of Samples	Dimension	Category
ORL	400	1024	40
COIL20	1440	1024	20
Libras Movement	360	89	15
Landsat	296	36	6

Table 10. The best clustering results of different methods on ORL dataset.

Method	ACC	NMI	Purity	ARI	F-Score	Precision	Recall
RLSR	62.79 ± 2.89 (500)	81.04 ± 1.83 (500)	66.93 ± 2.19 (500)	49.88 ± 3.78 (100)	51.13 ± 3.64 (100)	44.28 ± 4.33 (100)	60.75 ± 2.65 (500)
FDEFS	62.82 ± 3.69 (200)	81.27 ± 1.59 (100)	67.25 ± 3.08 (100)	50.13 ± 3.71 (100)	51.37 ± 3.60 (100)	44.55 ± 3.85 (100)	60.88 ± 3.64 (50)
GS³FS	62.21 ± 1.55 (50)	80.99 ± 0.74 (50)	66.18 ± 1.33 (50)	49.86 ± 1.58 (50)	51.11 ± 1.53 (50)	44.17 ± 1.79 (50)	60.79 ± 2.32 (150)
S2LFS	61.93 ± 3.35 (350)	80.62 ± 1.45 (350)	66.82 ± 2.37 (350)	48.55 ± 3.61 (350)	49.82 ± 3.51 (350)	43.55 ± 3.60 (350)	58.74 ± 4.44 (400)
AGLRM	64.21 ± 3.70 (50)	81.84 ± 1.89 (50)	68.00 ± 3.14 (50)	51.16 ± 4.40 (50)	52.36 ± 4.26 (50)	45.80 ± 4.72 (50)	61.29 ± 4.00 (50)
ASLCGLFS	58.32 ± 3.68 (250)	78.56 ± 2.33 (250)	63.32 ± 3.10 (250)	44.22 ± 5.03 (250)	45.62 ± 4.85 (250)	39.37 ± 5.58 (300)	54.62 ± 3.95 (300)
SFS-AGGL	67.96 ± 2.30 (250)	84.17 ± 1.50 (400)	71.89 ± 1.94 (500)	56.89 ± 3.34 (400)	57.95 ± 3.25 (400)	50.69 ± 3.47 (400)	67.71 ± 3.11 (400)

Table 11. The best clustering results of different methods on COIL20 dataset.

Method	ACC	NMI	Purity	ARI	F-Score	Precision	Recall
RLSR	60.45 ± 3.98 (150)	72.19 ± 2.00 (250)	63.27 ± 3.19 (50)	50.84 ± 3.42 (300)	53.42 ± 3.19 (300)	48.67 ± 3.97 (300)	59.38 ± 2.56 (50)
FDEFS	58.52 ± 3.44 (50)	70.67 ± 2.62 (400)	61.55 ± 3.23 (50)	48.14 ± 4.12 (50)	50.93 ± 3.85 (50)	45.32 ± 4.36 (50)	58.42 ± 3.27 (150)
GS³FS	59.98 ± 2.85 (150)	72.31 ± 1.41 (250)	63.38 ± 2.52 (150)	50.23 ± 1.84 (250)	52.86 ± 1.70 (250)	47.65 ± 2.60 (250)	59.47 ± 1.61 (250)
S2LFS	58.42 ± 3.90 (250)	70.19 ± 3.15 (250)	61.58 ± 3.61 (250)	46.40 ± 5.23 (450)	49.41 ± 4.74 (450)	42.64 ± 6.35 (250)	59.72 ± 2.47 (450)
AGLRM	59.85 ± 4.34 (150)	72.17 ± 2.59 (150)	63.17 ± 4.12 (150)	50.03 ± 4.46 (150)	52.71 ± 4.15 (150)	47.16 ± 5.17 (150)	60.05 ± 3.18 (300)
ASLCGLFS	60.02 ± 3.59 (50)	71.52 ± 1.67 (50)	62.95 ± 3.35 (50)	50.05 ± 2.56 (50)	52.67 ± 2.34 (50)	48.11 ± 3.81 (100)	58.55 ± 1.90 (50)
SFS-AGGL	61.88 ± 3.70 (350)	73.30 ± 1.78 (500)	64.67 ± 3.62 (350)	52.37 ± 1.80 (500)	54.85 ± 1.69 (500)	50.36 ± 3.16 (200)	62.16 ± 2.28 (500)

Table 12. The best clustering results of different methods on Libras Movement dataset.

Method	ACC	NMI	Purity	ARI	F-Score	Precision	Recall
RLSR	47.50 ± 2.21 (40)	60.07 ± 2.13 (56)	50.00 ± 1.85 (56)	30.04 ± 2.88 (56)	34.82 ± 2.69 (56)	31.38 ± 2.56 (56)	39.22 ± 3.70 (56)
FDEFS	46.33 ± 3.22 (32)	60.36 ± 2.88 (32)	50.22 ± 2.60 (24)	30.73 ± 3.77 (72)	35.58 ± 3.41 (72)	31.60 ± 3.65 (32)	41.28 ± 3.95 (72)
GS³FS	46.72 ± 3.24 (80)	60.57 ± 1.97 (56)	50.94 ± 2.24 (80)	31.20 ± 2.68 (56)	36.01 ± 2.53 (56)	31.79 ± 2.32 (80)	41.93 ± 3.79 (56)
S2LFS	46.56 ± 1.92 (80)	59.95 ± 1.12 (64)	50.72 ± 1.34 (64)	30.28 ± 1.80 (80)	35.13 ± 1.82 (80)	30.97 ± 1.17 (80)	40.82 ± 4.28 (80)
AGLRM	46.00 ± 2.89 (56)	60.35 ± 1.32 (56)	50.72 ± 1.87 (72)	30.80 ± 1.92 (56)	35.62 ± 1.88 (56)	31.42 ± 1.45 (56)	41.33 ± 4.05 (56)
ASLCGLFS	46.28 ± 3.41 (40)	59.72 ± 2.30 (40)	50.17 ± 2.33 (40)	29.93 ± 2.97 (40)	34.84 ± 2.77 (40)	30.60 ± 2.68 (40)	41.04 ± 4.67 (80)
SFS-AGGL	49.22 ± 2.88 (72)	62.33 ± 2.34 (72)	53.11 ± 2.70 (72)	33.04 ± 2.65 (56)	37.75 ± 2.46 (56)	33.57 ± 3.13 (72)	44.42 ± 4.83 (80)

Table 13. The best clustering results of different methods on Landsat dataset.

Method	ACC	NMI	Purity	ARI	F-Score	Precision	Recall
RLSR	48.30 ± 2.10 (6)	45.97 ± 1.02 (18)	50.59 ± 1.88 (18)	33.94 ± 1.46 (27)	47.53 ± 1.91 (27)	38.53 ± 1.49 (3)	63.72 ± 8.11 (27)
FDEFS	47.89 ± 2.65 (30)	45.68 ± 1.59 (30)	50.60 ± 2.30 (30)	33.49 ± 1.83 (30)	47.10 ± 1.99 (24)	38.12 ± 1.25 (30)	63.03 ± 6.57 (18)
GS³FS	49.10 ± 1.88 (15)	46.34 ± 1.05 (30)	51.37 ± 1.82 (21)	34.02 ± 1.33 (21)	47.69 ± 1.56 (21)	38.23 ± 1.33 (30)	64.44 ± 6.57 (21)
S2LFS	47.81 ± 2.63 (15)	45.99 ± 1.27 (30)	49.86 ± 2.53 (15)	34.03 ± 0.82 (15)	47.46 ± 1.24 (15)	38.83 ± 1.60 (30)	62.37 ± 7.02 (15)
AGLRM	49.06 ± 3.14 (9)	45.83 ± 1.40 (15)	50.97 ± 3.01 (9)	34.03 ± 1.62 (27)	47.23 ± 2.13 (27)	39.39 ± 1.48 (15)	60.78 ± 8.67 (27)
ASLCGLFS	48.79 ± 1.78 (30)	46.51 ± 1.41 (18)	50.59 ± 2.03 (24)	34.77 ± 0.83 (21)	48.33 ± 0.99 (21)	38.67 ± 1.17 (30)	65.35 ± 4.89 (21)
SFS-AGGL	51.02 ± 1.99 (12)	47.21 ± 1.18 (18)	52.81 ± 1.76 (12)	35.49 ± 1.23 (27)	49.04 ± 1.23 (18)	40.17 ± 1.32 (30)	69.26 ± 4.51 (15)

The numbers in parentheses denote the feature dimensions that yield the optimal results.

Table 14. Runtime(s) of different methods on different datasets.

Method	AR	Extended YaleB	CMU PIE	ORL	COIL20
RLSR	28.3706	28.8286	27.0407	27.0545	23.2237
FDEFS	154.3282	190.8570	210.9357	55.6699	71.4754
GS³FS	41.5550	44.3966	49.7174	28.2427	27.0868
S2LFS	52.7492	52.1703	54.4472	50.3727	47.9846
AGLRM	11.7974	13.1744	15.9342	3.4292	4.5080
ASLCGLFS	2690.2222	2477.4907	3735.1528	126.7054	340.7762
SFS-AGGL	15.5277	13.7843	17.4384	5.5204	6.7449
SFS-AGGL(GPU)	10.2069	9.3128	11.0843	3.3713	3.9999

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, Y.; Zhang, H.; Zhang, N.; Zhou, W.; Huang, X.; Xie, G.; Zheng, C. SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information. Information 2024, 15, 57. https://doi.org/10.3390/info15010057

AMA Style

Yi Y, Zhang H, Zhang N, Zhou W, Huang X, Xie G, Zheng C. SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information. Information. 2024; 15(1):57. https://doi.org/10.3390/info15010057

Chicago/Turabian Style

Yi, Yugen, Haoming Zhang, Ningyi Zhang, Wei Zhou, Xiaomei Huang, Gengsheng Xie, and Caixia Zheng. 2024. "SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information" Information 15, no. 1: 57. https://doi.org/10.3390/info15010057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information

Abstract

1. Introduction

2. Related Work

2.1. Notations

2.2. Sparse Representation

2.3. Constructing Graph Methods

2.4. Label Propagation Algorithm

2.5. The Graph-Based Semi-Supervised Sparse Feature Selection

3. The Proposed Method

3.1. Methodology Model

3.1.1. SFS Model

3.1.2. Global and Local Adaptive Graph Learning (AGGL) Model

3.1.3. Objective Function

3.2. Model Optimization

3.3. Algorithm Description

3.4. Computational Complexity and Convergence Analysis

3.4.1. Computational Complexity Analysis

3.4.2. Proof of Convergence

4. Experiment and Analysis

4.1. Description of the Comparison Methods

4.2. Classification Experiments

4.2.1. Classification Datasets

4.2.2. Evaluation Metric

4.2.3. Experimental Setup for Classification Task

4.2.4. Analysis of Classification Results

4.3. Clustering Experiments

4.3.1. Clustering Datasets

4.3.2. Evaluation Metrics

4.3.3. Experimental Setup for Clustering

4.3.4. Analysis of Clustering Results

4.4. Convergence and Runtime Analysis

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI