Next Article in Journal
Dynamic Analysis of the Median Nerve in Carpal Tunnel Syndrome from Ultrasound Images Using the YOLOv5 Object Detection Model
Previous Article in Journal
Nutritional Factors Associated with Dental Caries across the Lifespan: A Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Multi-View Integrated Ensemble for the Background Discrimination of Semi-Supervised Semantic Segmentation

1
Department of Applied Statistics, Konkuk University, Seoul 05029, Republic of Korea
2
AI Analytics Team, Mustree, Seoul 05029, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(24), 13255; https://doi.org/10.3390/app132413255
Submission received: 24 October 2023 / Revised: 15 November 2023 / Accepted: 20 November 2023 / Published: 14 December 2023

Abstract

:
The key to semi-supervised semantic segmentation is to assign the appropriate pseudo-label to the pixels of unlabeled images. Recently, various approaches to consistency-based training and the filtering of reliable pseudo-labels have shown remarkable results. Nonetheless, there are still issues to be addressed. We find that recent approaches have specific problems in common. In pseudo-labels for training unlabeled images, we confirm that false foreground class pseudo-labels are mostly caused by background class confusion, not confusion between different foreground classes. To solve this problem, we propose a foreground and background discrimination model for semi-supervised semantic segmentation. Our proposed model is trained using a novel approach called multi-view integrated ensemble (MVIE) via output perturbation. Experimental results in various partition protocols show that our approach outperforms the existing state of the art (SOTA) in binary prediction on unlabeled data, and the segmentation model trained with the help of our model outperforms existing models.

1. Introduction

Semantic segmentation, which aims to achieve pixel-level classification in images, is a fundamental task in computer vision. It is widely applied in real-world applications, including robotics, AR, VR, autonomous driving, disease diagnosis, and more. With the rise of supervised learning-based deep neural networks such as convolutional neural networks (CNNs), this method has shown impressive performance [1,2,3,4]. However, it still requires a large-scale labeled dataset for training [5,6]. When building a large dataset for training, tasks such as image classification require one label per image, but tasks such as segmentation require pixel-wise labeling, which can lead to high costs. To alleviate this problem, there have been studies on semi-supervised semantic segmentation that effectively utilize small amounts of labeled data and large amounts of unlabeled data. Recently, a semi-supervised semantic segmentation study further demonstrated the possibility of utilizing unlabeled data by showing results close to the performance of fully supervised semantic segmentation [7,8,9,10].
A common solution for semi-supervised semantic segmentation is to make predictions for unlabeled data using a model trained on labeled data, and then use those predictions as pseudo-labels to enhance the model through training [11]. Since the pseudo-labels of this method are predictive results, predictive errors exist, which can have a negative effect on model training [12,13]. A typical solution to overcome this problem is to use predicted confidence scores to filter, and use only reliable pseudo-labels with high confidence and not unreliable pseudo-labels with low confidence [14]. Current state-of-the-art (SOTA) semi-supervised semantic segmentation models are based on consistency regularization [9,15,16]. One example independently applied weak and strong perturbations to unlabeled images. Then, weakly augmented unlabeled image predictions with high confidence scores were assigned as pseudo-labels to strongly augmented unlabeled images, forcing consistency of output [14]. Perturbations can be applied in a variety of ways. Input perturbations can be applied using methods such as CutOut [17] and CutMix [18], as well as classical image augmentation methods such as color jitter. Feature perturbations can also be applied by injecting noise into the feature space. In particular, network perturbations are applied by encouraging consistency in the predictions of multiple models trained from different initializations. This approach showed better results than input or feature perturbation approaches [15]. Consistency regularization is based on a smoothness assumption, which means that if two input points are close in the input space, the corresponding two labels must be the same [19]. This improves generalization performance by encouraging the model to make stable predictions even with small perturbations [8]. However, this assumes perfect prediction for unlabeled data, which is difficult to achieve in practice, and this assumption includes the implication that perturbations do not push image features to the wrong side of the true decision boundary [19]. To reduce the risk of these assumptions, many previous studies have attempted to obtain high-quality pseudo-labels, based on confidence scores. Although significant results have been achieved through various pseudo-label-based approaches for unlabeled data, there has not been much exploration of what specific problems pseudo-labels have. From this perspective, we found a problem that recent SOTA models have in common: despite various approaches, pseudo-labels based on predictions for unlabeled data have common error characteristics. Figure 1a–c shows the normalized confusion matrix of pseudo-labels in semi-supervised semantic segmentation SOTA methods [7,8,9]. In a normalized confusion matrix, which is a square matrix, the main diagonal element represents the correct prediction results, and the remaining lower or upper diagonal elements represent the incorrect prediction result for each class. As can be seen from all three of these graphs, the false pseudo-labels in most of the foreground classes are mostly confused with the background class (class 0), not with the other foreground classes. We believe that this result is due to a lack of information and predictive certainty in semi-supervised semantic segmentation, considering that all areas except foregrounds of interest should be assigned as background. That is, unlike each foreground class that may have common image information, the background class may have a complex and diverse image information structure, which can further cause confusion. In many real-world applications, this problem can be considered even more important in that it aims to classify all remaining areas as background areas except for some foregrounds of interest, and it is necessary to perform good predictions on unseen data. To alleviate this problem, we propose a new approach called Multi-View Integrated Ensemble (MVIE), which can better distinguish between the background and foreground in semi-supervised semantic segmentation. MVIE is based on a novel ensemble approach based on output perturbation, and is described in detail in Section 3.
We evaluated the proposed MVIE under various training settings in PASCAL VOC 2012 [20], where many different kinds of objects are assigned background areas. Our experimental results on unlabeled data show that our proposed approach has better background or foreground discrimination capabilities than recent SOTA semi-supervised semantic segmentation models. Furthermore, we used our MVIE model to first determine pseudo-labels as background or foreground, and then experimented with semi-supervised semantic segmentation SOTA models using pseudo-labels under that discrimination. In this way, all experimental results combining our MVIE model and recent SOTA models show better performance than those found using single SOTA models.
Specifically, our contributions include the following:
  • We find a common problem with pseudo-labels in semi-supervised semantic segmentation SOTA models. The reason for false foreground pseudo-labels is not primarily due to confusion among other foreground classes, but rather, it is mainly a problem of confusion with the background class.
  • To alleviate the above problem, we propose a novel ensemble approach based on output perturbation. Our method outperforms existing SOTA models in background and foreground classification performance on unlabeled data.
  • When training an existing SOTA model with the help of our model, although the computational cost of training increases, the inference process for practical use incurs the same cost as each existing model. Therefore, from a practical usage perspective, further improved performance can be achieved without increasing computational costs.

2. Related Work

In this section, we review approaches to semi-supervised learning and semi-supervised semantic segmentation-related works.

2.1. Semi-Supervised Learning

The goal of semi-supervised learning is to improve model accuracy by accurately and effectively learning not only information from labeled data but also information from unlabeled data. Two representative approaches for this problem are: entropy minimization and consistency regularization [14].

2.1.1. Entropy Minimization

Entropy minimization aims to minimize the predictive uncertainty of a model for unlabeled data. The entropy of predictions, calculated as a negative sum of the product of predictive probabilities and log predictive probabilities, is often used as a measure of uncertainty in model predictions. Minimizing entropy encourages the model to perform low-entropy predictions for unlabeled data. Recently, a more intuitive and effective framework, the entropy minimization of self-training [21,22,23], has shown its effectiveness by assigning pseudo-labels to unlabeled data and then retraining them in combination with labeled data. Within this training method, the quality of pseudo-labels can be an important factor in the effectiveness of entropy minimization. For this reason, many recent studies have used predictive probabilities as indicators to select and train more accurate pseudo-labels [7,12,14,24].

2.1.2. Consistency Regularization

Consistency regularization encourages consistency in the output from inputs that have been perturbed in various ways, allowing decision boundaries to be placed in low-density regions. There are various methods of perturbing inputs, but FixMatch [14] encourages prediction consistency by perturbing inputs similar to data augmentation. Cutmix [18] perturbs the input by replacing part of an image with part of another image, which is common in many high-performance approaches.
FixMatch uses an approach to supervise unlabeled data with strong perturbations using predictions for data with weak perturbations. Through this method, which essentially utilizes both entropy minimization and consistency regularization, the possibility of success of semi-supervised learning was experimentally demonstrated.

2.2. Semi-Supervised Salient Object Detection

Salient object detection (SOD) is a crucial computer vision task aimed at precisely identifying and segmenting distinctive regions within an image, using methods that closely mimic the way humans perceive visually unique information. This task is partly related to ensuring good performance in background and foreground segmentation, which is a key aspect of our research. In recent years, SOD has attracted attention because salient image regions can be applied to modern computer vision tasks such as object recognition, visual tracking, image segmentation, etc. In light of this attention, state-of-the-art fully supervised SOD models have achieved remarkable performance, relying on a large amount of pixel-wise labeled data. However, obtaining such a fully-labeled dataset is expensive and time-consuming. Therefore, recent developments have focused on semi-supervised SOD models used to overcome a lack of labeled data, and challenges in SOD such as object size variety, object invisibility, cluttered backgrounds, etc. LFCS [25] employs semi-supervised learning to distinguish unlabeled regions by leveraging a substantial amount of unlabeled data alongside labeled data to enhance classifier performance. LFCS utilizes linear feedback control theory as a mathematical foundation for formulating semi-supervised calssifiers. EBM [26] is the latent variable model for semi-supervised SOD, conceptualized as a problem of learning pseudo-label confidence. Also, EBM incorporates a non-Gaussian prior distribution through an energy-based model for the latent variable. The exploration of an informative latent space enhances confidence estimation accuracy, facilitating the effective utilization of unlabeled training data. ASOD [27] is an active learning framework for semi-supervised SOD, desigened to optimize network performace with minimal annotation costs. ASOD introduces adversarial learning and unsupervised feature representation through a Variational Autoencoder (VAE) to identify discriminative and representative samples for addition to the labeled pool.

2.3. Supervised Semantic Segmentation

HRNet [28] connects high-to-low convolution streams in parallel, ensuring the maintenance of high-resolution representations throughout the process. It achieves reliable high-resolution representations with strong position sensitivity by iteratively fusing representations from multi-resolution streams. It enables HRNet to achieve superior results on a wide range of visual recognition problems including semantic segmentation as a stonger backbone. Wang et al. [29] introduced a supervised, pixel-wise contrastive learning approach for semantic segmentation, transitioning from the current image-wise training strategy to an inter-image, pixel-to-pixel paradigm. This design enables access to more representative data samples, and facilitates the exploration of structural relations between pixels and semantic-level segments, emphasizing proximity in the embedding space for pixels and segments of the same class. Zhou et al. [30] introduced a novel approach to semantic segmentation by abstracting each class through a set of prototypes that effectively capture class-wise characteristics and intra-class variance. The interpretability of the model is enhanced as the prediction for each pixel is intuitively understood to reference its closest class center in the embedding space.

2.4. Semi-Supervised Semantic Segmentation

Recent studies on semi-supervised semantic segmentation have shown excellent results that are close to the performance of fully supervised semantic segmentation models by applying consistency-based methodologies in various ways. CCT [15] enforces agreement between the results of applying perturbed features of various other kinds, and the results of unperturbed features. CPS [9] uses the network consistency method by forcing consistency on the outputs of two models starting with different initializations. PS-MT [8] uses confidence-weighted cross-entropy loss, which multiplies the cross-entropy loss by the segmentation prediction confidence when calculating the unsupervised loss of unlabeled data. PS-MT additionally enforces consistency for predictions using both network perturbation by two teacher models, input perturbation using CutMix with weak and strong augmentation, and feature perturbation using virtual adversarial training (VAT). U 2 PL [7] uses reliable pseudo-labels by filtering based on the probability distribution entropy of all pixels. U 2 PL additionally uses pseudo-labels for most pixel addresses by pushing unreliable pseudo-labels into a queue composed of negative samples and using them for contrastive loss, which is an unsupervised loss. UniMatch [31] revisits FixMatch, a semi-supervised image classification study, for semi-supervised semantic segmentation research. Interestingly, when the FixMatch study with a simple pipeline is converted to a semi-supervised semantic segmentation scenario, it shows competitive results compared to the SOTA study. However, since FixMatch relies heavily on strong augmentation based on passive design, UniMatch has proposed a Unified Dual-Stream Perturbations approach to mitigate this issue. As a result, the method experimentally reports improved performance by leading to an expanded perturbation space. S4MC [32] proposes a novel confidence refinement scheme to improve the quality of pseudo-labels for semantic segmentation. Unlike common solutions that do not use pseudo-labels for low-confidence predictions, S4MC leverages the spatial correlation of labels in segmentation maps by grouping adjacent pixels and considering pseudo-labels collectively. Through this, S4MC maintains the quality of pseudo-labels while simultaneously increasing the amount of pseudo-labels used during training. Several SOTA studies have shown semi-supervised semantic segmentation results that can be used in practice by using various consistency and pseudo-labels filtering methods.
However, as discussed in Section 1, technical problems such as overfitting and reliability problems of pseudo-labels have been reviewed so far, but not much has been written about the specific problems of pseudo-labels. We found that a common reason for mispredicted pseudo-labels used in recent SOTA studies is that there is no confusion between different foreground classes, but mostly between each foreground and background class. Therefore, we propose an improved background and foreground binary segmentation model for semi-supervised semantic segmentation. Our approach uses a consistency training approach based on input perturbations and new output perturbations.

3. Proposed Method

In this section, we mathematically describe our problem, architecture, and training process. Section 3.1 first gives an overview of the proposed method. The proposed method consists of multiple multi-view teacher networks and student networks. Our strategy for reliable pseudo-label filtering, which is achieved by applying a new ensemble technique called MVIE on the predictions of multi-view teachers, is described in Section 3.2, along with the model architecture of multi-view teachers. Finally, in Section 3.3, pseudo-labels of students generated through the ensemble of all teachers are introduced along with the student model architecture.

3.1. Overview

Semi-supervised semantic segmentation aims to efficiently utilize the information of both unlabeled data and labeled data. Therefore, we have a small amount of labeled data D L = { ( x i l , y i l ) } i = 1 | D L | and a large amount of unlabeled data D U = { x i u } i = 1 | D U | . Specifically, x i l , x i u R H × W × 3 and y i l R H × W × C , where H and W are the height and width of the image and C is the number of classes. This dataset is used to train students and all teachers. Additionally, y i l is converted and redefined for students and teachers.
Figure 2 shows the transformation of the semantic ground truth to train all our models when the number of classes is 6. First, we construct C-1 teacher networks and 1 student network for our approach. The student and teachers output y ^ s { 0 , 1 } H × W and y ^ t { 0 , 1 , 2 } H × W , respectively. Therefore, labels for students and teachers have 2 classes and 3 classes, respectively That is, we re-define the labels for students and teachers using Equations (1) and (2) below for the semantic ground truth:
y i l ( n ) = 0 , if y i l = 0 1 , if y i l = n 2 , otherwise ,
y i l ( s ) = 0 , if y i l = 0 1 , otherwise ,
where y i l is the i-th semantic ground truth, and y i l ( n ) and y i l ( s ) are the re-defined ground truths for the n-th teacher and student, respectively. The student is trained based on input perturbation, and teachers are trained based on new output perturbation.
Figure 3 shows an overview of MVIE. In the MVIE architecture, the model named teacher consists of a CNN-based encoder h and decoder g with a segmentation head. The teacher model is decomposed into encoder h θ h t : X Z and decoder g θ g t : Z Y , where Z R Z represents the feature space of dimension Z . All teacher models consist of the number of classes minus 1, and have the same structure. Hence, teachers are denoted f θ t = { f θ t ( i ) } i = 1 C 1 , where C is the number of classes. The student consists of one model with only a decoder g θ g s : Z Y , where Z R Z encoded by a fixed teacher represents a Z-dimensional feature space. Therefore, one of the teacher models, named fixed teacher, is responsible for encoding the input and passing the features to the student. All configured teacher models and the student model have different initial weights. For all labeled images, the goal of the student and all teachers is to minimize the standard cross-entropy of Equations (4) and (7). For unlabeled images, each teacher model receives an image with weak augmentation and outputs y ^ t { 0 , 1 , 2 } H × W . After that, each teacher obtains pseudo-labels with a new ensemble method using the predictions of all other teachers and computes the teacher’s unsupervised loss in Equation (6). This part is introduced in detail in Section 3.2. The student decodes the features encoded by the fixed teacher from the image with strong augmentation and outputs y ^ s { 0 , 1 } H × W , and the results of applying a hard voting ensemble from predictions of all teachers are used as pseudo-labels to calculate the student’s unsupervised loss in Equation (9). This part is introduced in detail in Section 3.3.
The optimization target for our students and teachers is to minimize the overall loss, which can be formulated as follows:
L T = L s u p t + λ t L u n s u p t , L S = L s u p s + λ s L u n s u p s ,
where L T and L S are teacher and student overall loss, L s u p t and L s u p s are teacher and student supervised segmentation loss calculated from labeled data, and L u n s u p t and L u n s u p s are teacher and student unsupervised segmentation loss calculated from unlabeled data, respectively. λ t and λ s are the weights of the teachers’ and student’s unsupervised segmentation loss, respectively. In summary, the semantic segmentation labels are converted into different teacher labels for each teacher using Equation (1), and then the teacher’s supervised segmentation loss L s u p t is calculated using the standard cross-entropy loss function. The unsupervised segmentation loss for each teacher L u n s u p t is calculated using a standard cross-entropy loss function based on each teacher’s pseudo-labels generated using our new MVIE method. The student’s supervised segmentation loss L s u p s is calculated using a standard cross-entropy loss function based on student labels whose semantic segmentation labels are converted using Equation (2). Finally, the student’s unsupervised segmentation loss L u n s u p s is calculated using a standard cross-entropy loss function based on the student’s pseudo-labels generated via a hard voting ensemble of all teachers.

3.2. Teacher Model Using a Multi-View Integrated Ensemble to Generate Pseudo-Labels

In teachers’ unlabeled data training, we propose a novel ensemble technique-based MVIE for reliable pseudo-label filtering. MVIE applies a new output perturbation method. This output perturbation redefines the semantic classes into three classes, and the meaning of the three classes is different for each teacher. We call the teacher with this new output perturbation a multi-view teacher.
Figure 4 shows an example in which a multi-view teacher integrated ensemble is applied to the first teacher when the number of classes is 6. As the number of classes is 6, there are 5 teachers (C-1), and the table value for each teacher’s class means that the actual semantic classes can be included, i.e., the first teacher’s class 0 means the actual semantic class 0, class 1 means semantic class 1, which is the same as the teacher number, and class 2 means 2 to 5, which are all other semantic classes.
When we consider the pseudo-labels of a specific teacher, class 0 is defined as an overlapping area in all other teachers’ class 0 predictions, class 1 is an overlapping area in all other teachers’ class 2 predictions, and class 2 is a non-overlapping area in all other teachers’ class 1 predictions.
As introduced in Section 3.1, there are as many teacher models with an encoder–decoder structure as the number of classes minus 1, i.e., teacher-1, teacher-2, teacher-3, …, teacher (C-1). Also, each teacher outputs y ^ t { 0 , 1 , 2 } H × W . A value of 0 in each teacher’s predicted output represents the background class of the semantic ground truth (generally used as class 0), 1 represents the actual semantic ground truth class corresponding to the teacher’s number, and 2 is assigned to all classes except those assigned to classes 0 or 1 in the actual semantic ground truth classes; this means that in the PASCAL VOC 2012 dataset with ground truth semantic classes ranging from 0 to 20, if we consider the case of teacher-3, the predicted output 0 indicates the background class 0 in the actual PASCAL VOC 2012 ground truth, 1 indicates the ground truth semantic class 3, and 2 indicates all classes except 0 and 3, which are 1, 2, 4, 5, 6, ..., 20.
In the teacher’s overall loss L T introduced in Equation (3), the first loss is the supervised segmentation loss L s u p t , defined based on the cross-entropy (CE) loss as follows:
L s u p t = 1 | D L | ( x i l , y i l ( n ) ) D L l c e ( g θ g t h θ h t ( x i l ) , y i l ( n ) )
where l c e is the cross-entropy loss function, and x i l and y i l ( n ) represent the i-th labeled image and corresponding n-th teacher’s label, respectively. h θ h t and g θ g t represent the teacher’s encoder and decoder, respectively. gh is the composition function of h and g. The second term in Equation (3) is the unsupervised segmentation loss L u n s u p t for the pseudo-labels made using the new ensemble technique MVIE. We use MVIE to filter out only reliable pixel-level pseudo-labels and ignore unreliable ones. Therefore, unreliable pseudo-labels are not subject to supervision, and we define the pseudo-labels of the n-th teacher made via MVIE for the i-th unlabeled image at pixel j, as follows:
y ^ i j ( n ) = 0 , if y ^ i j t ( k ) = 0 , k n , k { 1 , 2 , , C 1 } , 1 , if y ^ i j t ( k ) = 2 , k n , k { 1 , 2 , , C 1 } , 2 , if k = 1 , k n C 1 y ^ i j t ( k ) = 1 , ignore , otherwise ,
where C represents the number of classes. The unsupervised segmentation loss L u n s u p t is defined as
L u n s u p t = 1 | D U | x i u D U l c e ( g θ g t h θ h t ( A w ( x i j u ) ) , y ^ i j ( n ) )
where x i j u and y ^ i j ( n ) represent the i-th unlabeled image and corresponding pseudo-labels at pixel j, respectively, and A w ( · ) represents a weak augmentation function, such as image flipping, cropping, or scaling.
Finally, each teacher is trained with a loss function, which is the weighted sum of Equation (4), L s u p t , based on ground truth labels and Equation (6), L u n s u p t , based on pseudo-labels made through MVIE in Equation (5).

3.3. Student Model That Outputs a Binary Using an Ensemble of Multi-View Teachers

As introduced in Section 3.1, the student model consists of one model and has only a decoder structure. Also, the student outputs the binary output y ^ s { 0 , 1 } H × W .
In the student’s training of unlabeled data, strong augmentation is applied for input perturbations. Weak augmentations, such as image flipping, cropping, and scaling, are applied for the teacher model, while strong augmentations, such as Gaussian blur, randomized grayscale, and color jitter, are applied for the student model. Among the teachers, one randomly selected is designated as the fixed teacher, and this teacher passes the features obtained by encoding the input image to the student. In the overall loss L S of the student introduced in Equation (3), the first loss, supervised segmentation loss L s u p s , is defined as
L s u p s = 1 | D L | ( x i l , y i l ( s ) ) D L l c e ( g θ g s h θ h t ( f i x e d ) ( x i l ) , y i l ( s ) )
where x i l and y i l ( s ) represent the i-th labeled image and corresponding student’s label, respectively, and h θ h t ( f i x e d ) and g θ g s represent the fixed teacher’s encoder and the student’s decoder, respectively.
The student’s pseudo-labels for the i-th unlabeled data at pixel j, based on the hard voting ensemble for all teachers, are defined as follows:
y ^ i j ( s ) = 1 ( f h v ( f θ t ( 1 ) ( x i j u ) , f θ t ( 2 ) ( x i j u ) , , f θ t ( C 1 ) ( x i j u ) ) > 0 )
where 1 (·) is the indicator function, f h v is the hard voting ensemble function, and x i j u is the i-th unlabeled data at pixel j.
The second loss of Equation (3), unsupervised segmentation loss L u n s u p s , is defined as
L u n s u p s = 1 | D U | x i u D U l c e ( g θ g s h θ h t ( f i x e d ) ( A s ( x i j u ) ) , y ^ i j ( s ) )
where A s ( · ) represents a strong augmentation function.
Finally, the student is trained as the weighted sum of Equation (7), L s u p s , based on ground truth labels, and Equation (9), L u n s u p s , based on the student’s pseudo-labels y ^ i j ( s ) made by the hard voting ensemble of all teachers.

4. Experiments

This section introduces the evaluation metrics, data, model, and implementation details used in the proposed method. We experimented under different partition protocol settings to investigate the efficiency and effectiveness of the proposed method, and investigated its performance on the unlabeled dataset as well as on the validation dataset.

4.1. Dataset

The standard datasets for semi-supervised segmentation include PASCAL VOC 2012 and Cityscapes. However, while PASCAL VOC 2012 classes contain multiple object classes and background classes, Cityscapes does not have a background class, and all classes consist of only object classes. Because our goal is to build a model that can discriminate well between background and foreground, we adopted the PASCAL VOC 2012 dataset with a background class. The standard semantic segmentation benchmark dataset, PASCAL VOC 2012, consists of 20 semantic classes of objects and 1 background class. The training and validation sets in this dataset consist of 1464 and 1449 images, respectively. We adopted the general practice of previous studies [7,8,9] using 10,582 labeled images augmented from [33] as additional data. To achieve the goal of semi-supervised semantic segmentation, we subsampled this full training dataset at a ratio of 1:8, 1:4, and 1:2, used it as labeled data, and the remaining ratio was used as unlabeled data.

4.2. Implementation Details

For a fair comparison with previous work, we adopted DeepLabV3+ [34] as a segmentation model, using ResNet [35] as the backbone network. The ResNet backbone network uses ImageNet [36] pre-trained weights as initial weights, and the weights of segmentation heads are randomly initialized. The experimental results are listed in Table 1 and Table 2, containing all the results of our re-implementation using ResNet-50 as the backbone network. During the experiments on our model and all the re-implements, each mini-batch consisted of eight labeled images and eight unlabeled images due to hardware limitations in our environment, and the training epochs were set to 80. To train our model with Sync-BN [37], we used the stochastic gradient descent (SGD) optimizer and set the initial learning rate to 0.0025, moment to 0.9, and weight decay to 0.0001. In addition, we adopted a poly learning rate policy, in which the initial learning rate is multiplied by ( 1 i t e r m a x _ i t e r ) 0.9 at each iteration. The crop size of the images was 512 × 512 , with multi-scale data augmentation applied by randomly selecting scales {from 0.5, 0.75, 1, 1.25, 1.5, and 1.75}. Additionally, considering that most SOTA studies have demonstrated performance improvement using CutMix, we apply CutMix after prediction used in [8] to our proposed method.
We re-implemented previous SOTA methods in our environment. For fairness, batch size and training epochs were set to 8 and 80, respectively, which are the same settings as our model training, and all other settings are set to be the same as each study setting. All of our experiments were implemented using the PyTorch v1.12.0 framework [38] on servers with two NVIDIA GeForce RTX 3090.

4.3. Evaluation Metrics

Following previous research [7,8,9], we report mean Intersection-over-Union (mIoU) as an evaluation metric based on single-scale inference for all evaluations. In addition, we further check the accuracy and F1-score along with Intersection-over-Union (IoU) to assess pixel-level performance on unlabeled data.

4.4. Comparison to SOTA on Different Partition Protocols

To verify the background and foreground discrimination performance of pseudo-labels on unlabeled data, we temporarily used real labels for unlabeled data from PASCAL VOC 2012. Table 1 shows a comparison between our proposed model and the results of converting the predictions of SOTA models into binary predictions. Table 1 shows a comparison of the proposed method with the results of converting the predictions of SOTA models into binary predictions on the unlabeled data of PASCAL VOC 2012. Table 1 shows the accuracy (Acc), macro-F1 score ( F 1 ), background IoU (BG IoU), foreground IoU (FG IoU), and mean IoU (mIoU), showing that our method is experimentally superior in most indicators. In particular, in 1/8, a small labeled ratio experiment, some BG IoU increased and FG IoU increased significantly at the same time. This can be interpreted as a result of using more accurate foreground pseudo-labels in a small real information environment. In conclusion, under the 1/8, 1/4, and 1/2 partition protocols, our method outperforms the highest previous studies’ mIoU scores for background and foreground classification by +1.84%, +0.36%, and +0.003%, respectively. Table 2 shows the mIoU score on the validation set of PASCAL VOC 2012, comparing our re-implementation results of existing SOTA models with the case trained using the object areas proposed by our model in the generation of pseudo-labels for each SOTA model. In all the experimental results in Table 2, SOTA models combined with our model show better performance than existing models. This result is because our model, which has better background classification performance, helps improve the quality of pseudo-labels used in learning SOTA models.

5. Qualitative Results

Figure 5 and Figure 6 visualize the segmentation results of some images from the unlabeled dataset and validation dataset of PASCAL VOC 2012, respectively. In column (f) of Figure 5, we can see that our results, trained based on more accurate pseudo-labels, produce cleaner and better-performing background and foreground classification results than existing SOTA approaches. Furthermore, we can identify better background discrimination capabilities in validation datasets as well.

6. Conclusions

We discovered a specific confusion problem of pseudo-labels that most SOTA studies have in common. This common problem is that in pseudo-labels of foreground objects, mispredictions are mostly related to confusion between the background and its foreground rather than confusion with other foregrounds. Therefore, this means that a high-performance model can be implemented if only this confusion problem is alleviated in semi-supervised semantic segmentation. To alleviate this problem in semi-supervised semantic segmentation, we propose a background and foreground discrimination model using MVIE based on a new output perturbation and a new ensemble method. We experimentally demonstrated the effectiveness of the proposed method. The numerical results of the experiment show that under 1/8, 1/4, and 1/2 partition protocols, the mIoU scores for background and foreground outperform the existing best model by +1.84%, +0.36%, and +0.003%, respectively. Moreover, the performance of existing SOTA models trained with the help of our model is significantly improved over that of the existing single-SOTA model. Existing SOTA models trained with the help of our model showed no increase in inference time, but some computational cost increases in training. Because our approach consists of multiple networks, the training time is longer, but the inference time is similar to other methods. Therefore, we will consider an efficient approach that does not have a long training time and can alleviate the background confusion problem in a future study.

Author Contributions

Conceptualization, H.G. and S.K.; Formal analysis, H.G.; Funding acquisition, S.Y. and S.K.; investigation, H.G.; Methodology, H.G.; Project administration, H.G. and S.K.; Resources, H.G.; Software, H.G. and C.K.; Supervision, H.G. and S.K.; Validation, H.G.; Visualization, H.G.; Writing—original draft, H.G. and Y.L.; Writing—review & editing H.G., Y.J. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by Konkuk University Researcher fund in 2023, and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2021R1A4A5032622 and NRF-RS-2023-00240936).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the datasets used in this manuscript are publicly available datasets (PASCAL public dataset accessed on 6 February 2023: http://host.robots.ox.ac.uk/pascal/VOC/, already in the public domain).

Conflicts of Interest

Author Yongho Jeong was employed by the company Mustree. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  2. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  3. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  4. Yuan, Y.; Chen, X.; Chen, X.; Wang, J. Segmentation transformer: Object-contextual representations for semantic segmentation. arXiv 2019, arXiv:1909.11065. [Google Scholar]
  5. Cho, J.; Lee, K.; Shin, E.; Choy, G.; Do, S. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy? arXiv 2015, arXiv:1511.06348. [Google Scholar]
  6. Chen, X.W.; Lin, X. Big data deep learning: Challenges and perspectives. IEEE Access 2014, 2, 514–525. [Google Scholar] [CrossRef]
  7. Wang, Y.; Wang, H.; Shen, Y.; Fei, J.; Li, W.; Jin, G.; Wu, L.; Zhao, R.; Le, X. Semi-supervised semantic segmentation using unreliable pseudo-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4248–4257. [Google Scholar]
  8. Liu, Y.; Tian, Y.; Chen, Y.; Liu, F.; Belagiannis, V.; Carneiro, G. Perturbed and strict mean teachers for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4258–4267. [Google Scholar]
  9. Chen, X.; Yuan, Y.; Zeng, G.; Wang, J. Semi-supervised semantic segmentation with cross pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2613–2622. [Google Scholar]
  10. Fan, S.; Zhu, F.; Feng, Z.; Lv, Y.; Song, M.; Wang, F.Y. Conservative-progressive collaborative learning for semi-supervised semantic segmentation. IEEE Trans. Image Process. 2023, 32, 6183–6194. [Google Scholar] [CrossRef] [PubMed]
  11. Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the Workshop on Challenges in Representation Learning, ICML, Atlanta, GA, USA, 16–21 June 2013; Volume 3, p. 896. [Google Scholar]
  12. Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N.E.; McGuinness, K. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–8. [Google Scholar]
  13. Pham, H.; Dai, Z.; Xie, Q.; Le, Q.V. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11557–11568. [Google Scholar]
  14. Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.L. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
  15. Ouali, Y.; Hudelot, C.; Tami, M. Semi-supervised semantic segmentation with cross-consistency training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12674–12684. [Google Scholar]
  16. Tu, P.; Huang, Y.; Ji, R.; Zheng, F.; Shao, L. Guidedmix-net: Learning to improve pseudo masks using labeled images as reference. arXiv 2021, arXiv:2106.15064. [Google Scholar]
  17. DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
  18. Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
  19. Van Engelen, J.E.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020, 109, 373–440. [Google Scholar] [CrossRef]
  20. Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
  21. Zoph, B.; Ghiasi, G.; Lin, T.Y.; Cui, Y.; Liu, H.; Cubuk, E.D.; Le, Q. Rethinking pre-training and self-training. Adv. Neural Inf. Process. Syst. 2020, 33, 3833–3845. [Google Scholar]
  22. Hung, W.C.; Tsai, Y.H.; Liou, Y.T.; Lin, Y.Y.; Yang, M.H. Adversarial learning for semi-supervised semantic segmentation. arXiv 2018, arXiv:1802.07934. [Google Scholar]
  23. Mittal, S.; Tatarchenko, M.; Brox, T. Semi-supervised semantic segmentation with high-and low-level consistency. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1369–1379. [Google Scholar] [CrossRef] [PubMed]
  24. Kumar, T.; Park, J.; Ali, M.S.; Uddin, A.S.; Ko, J.H.; Bae, S.H. Binary-classifiers-enabled filters for semi-supervised learning. IEEE Access 2021, 9, 167663–167673. [Google Scholar] [CrossRef]
  25. Zhou, Y.; Huo, S.; Xiang, W.; Hou, C.; Kung, S.Y. Semi-supervised salient object detection using a linear feedback control system model. IEEE Trans. Cybern. 2018, 49, 1173–1185. [Google Scholar] [CrossRef]
  26. Liu, J.; Zhang, J.; Barnes, N. Semi-supervised salient object detection with effective confidence estimation. arXiv 2021, arXiv:2112.14019. [Google Scholar]
  27. Lv, Y.; Liu, B.; Zhang, J.; Dai, Y.; Li, A.; Zhang, T. Semi-supervised active salient object detection. Pattern Recognit. 2022, 123, 108364. [Google Scholar] [CrossRef]
  28. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
  29. Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; Van Gool, L. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7303–7313. [Google Scholar]
  30. Zhou, T.; Wang, W.; Konukoglu, E.; Van Gool, L. Rethinking semantic segmentation: A prototype view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2582–2593. [Google Scholar]
  31. Yang, L.; Qi, L.; Feng, L.; Zhang, W.; Shi, Y. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7236–7246. [Google Scholar]
  32. Kimhi, M.; Kimhi, S.; Zheltonozhskii, E.; Litany, O.; Baskin, C. Semi-Supervised Semantic Segmentation via Marginal Contextual Information. arXiv 2023, arXiv:2308.13900. [Google Scholar]
  33. Hariharan, B.; Arbeláez, P.; Bourdev, L.; Maji, S.; Malik, J. Semantic contours from inverse detectors. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 991–998. [Google Scholar]
  34. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
  35. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  36. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
  37. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
  38. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
Figure 1. (ac) show normalized confusion matrices for unlabeled data in CPS, PS-MT, and U 2 PL, respectively [7,8,9]. (d) shows the mean intersection over union (MIoU) score for CPS, PS-MT, U 2 PL, and our proposed method MVIE.
Figure 1. (ac) show normalized confusion matrices for unlabeled data in CPS, PS-MT, and U 2 PL, respectively [7,8,9]. (d) shows the mean intersection over union (MIoU) score for CPS, PS-MT, U 2 PL, and our proposed method MVIE.
Applsci 13 13255 g001
Figure 2. Example of the transformation of the semantic ground truth to train all our models when the number of classes is 6.
Figure 2. Example of the transformation of the semantic ground truth to train all our models when the number of classes is 6.
Applsci 13 13255 g002
Figure 3. Overview of our proposed MVIE method. MVIE consists of teacher networks with encoder-decoder structures and a student network with only a decoder structure. Teachers are composed of the number of semantic classes-1 (C-1), and the student is composed of only 1. The fixed teacher is responsible for encoding the input image and then passing the features to the student as input. Labeled data are input to all teachers and students, and used to calculate the supervised loss based on cross-entropy. Unlabeled data with weak augmentation are input to all teachers to create predictions, and each teacher’s pseudo-labels for the unsupervised loss of a teacher are made by applying a new integrated ensemble approach by multi-view teachers from all other teachers’ predictions. The unlabeled data with strong augmentation are additionally input to the fixed teacher and encoded, and the encoded features are input to the student to generate the student’s predictions for the unlabeled data. The result of applying a hard-voting ensemble in all teacher predictions becomes a pseudo-label of predictions for the student’s unlabeled data, which are used to calculate the student’s unsupervised loss.
Figure 3. Overview of our proposed MVIE method. MVIE consists of teacher networks with encoder-decoder structures and a student network with only a decoder structure. Teachers are composed of the number of semantic classes-1 (C-1), and the student is composed of only 1. The fixed teacher is responsible for encoding the input image and then passing the features to the student as input. Labeled data are input to all teachers and students, and used to calculate the supervised loss based on cross-entropy. Unlabeled data with weak augmentation are input to all teachers to create predictions, and each teacher’s pseudo-labels for the unsupervised loss of a teacher are made by applying a new integrated ensemble approach by multi-view teachers from all other teachers’ predictions. The unlabeled data with strong augmentation are additionally input to the fixed teacher and encoded, and the encoded features are input to the student to generate the student’s predictions for the unlabeled data. The result of applying a hard-voting ensemble in all teacher predictions becomes a pseudo-label of predictions for the student’s unlabeled data, which are used to calculate the student’s unsupervised loss.
Applsci 13 13255 g003
Figure 4. An example in which a multi-view teacher integrated ensemble is applied to the first teacher when the number of classes is 6.
Figure 4. An example in which a multi-view teacher integrated ensemble is applied to the first teacher when the number of classes is 6.
Applsci 13 13255 g004
Figure 5. Example of qualitative results from PASCAL VOC 2012 unlabeled dataset. (a) Input image, (b) binary ground truth, (c) CPS, (d) PS-MT, (e) U 2 PL, (f) ours. All approaches use DeepLabv3+ as a segmentation network with ResNet-50, and the red rectangle highlights false prediction results.
Figure 5. Example of qualitative results from PASCAL VOC 2012 unlabeled dataset. (a) Input image, (b) binary ground truth, (c) CPS, (d) PS-MT, (e) U 2 PL, (f) ours. All approaches use DeepLabv3+ as a segmentation network with ResNet-50, and the red rectangle highlights false prediction results.
Applsci 13 13255 g005
Figure 6. Example of qualitative results from PASCAL VOC 2012 validation dataset. (a) Input image, (b) binary ground truth, (c) CPS, (d) PS-MT, (e) U 2 PL, (f) ours. All approaches use DeepLabv3+ as a segmentation network with ResNet-50, and the red rectangle highlights false prediction results.
Figure 6. Example of qualitative results from PASCAL VOC 2012 validation dataset. (a) Input image, (b) binary ground truth, (c) CPS, (d) PS-MT, (e) U 2 PL, (f) ours. All approaches use DeepLabv3+ as a segmentation network with ResNet-50, and the red rectangle highlights false prediction results.
Applsci 13 13255 g006
Table 1. Binary performance comparison with SOTA for unlabeled data of PASCAL VOC 2012 under different partition protocols. Predictions from state-of-the-art models are converted to binary predictions, and all methods are based on DeepLabv3+. * represents our re-implementation.
Table 1. Binary performance comparison with SOTA for unlabeled data of PASCAL VOC 2012 under different partition protocols. Predictions from state-of-the-art models are converted to binary predictions, and all methods are based on DeepLabv3+. * represents our re-implementation.
Method1/81/41/2
AccF 1 BG IoUFG IoUmIoUAccF 1 BG IoUFG IoUmIoUAccF 1 BG IoUFG IoUmIoU
CPS *0.9170.90288.82275.81082.3160.9160.90088.59375.42282.0070.9170.90088.83975.19682.018
PS-MT *0.9190.90688.94376.45282.6970.9300.91790.32679.42984.8770.9310.91990.46579.75185.108
U 2 PL*0.9010.88486.61772.26879.4430.9170.90488.66876.63182.6490.9250.91289.74678.25984.002
Ours0.9270.91589.92479.16084.5420.9300.91990.33780.13685.2370.9300.92090.27379.94985.111
Table 2. Comparison of SOTA and models trained with the help of our model on the PASCAL VOC 2012 val set under different partitioning protocols. All methods are based on DeepLabv3+, and * represents our re-implementation.
Table 2. Comparison of SOTA and models trained with the help of our model on the PASCAL VOC 2012 val set under different partitioning protocols. All methods are based on DeepLabv3+, and * represents our re-implementation.
Method1/81/41/2
CPS *74.53775.55775.786
PS-MT *74.07274.39175.803
U 2 PL *73.00474.80876.004
CPS (w/Ours) *74.84976.11276.153
PS-MT (w/Ours) *74.15875.70275.839
U 2 PL (w/Ours) *74.94576.74276.908
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gwak, H.; Jeong, Y.; Kim, C.; Lee, Y.; Yang, S.; Kim, S. A Multi-View Integrated Ensemble for the Background Discrimination of Semi-Supervised Semantic Segmentation. Appl. Sci. 2023, 13, 13255. https://doi.org/10.3390/app132413255

AMA Style

Gwak H, Jeong Y, Kim C, Lee Y, Yang S, Kim S. A Multi-View Integrated Ensemble for the Background Discrimination of Semi-Supervised Semantic Segmentation. Applied Sciences. 2023; 13(24):13255. https://doi.org/10.3390/app132413255

Chicago/Turabian Style

Gwak, Hyunmin, Yongho Jeong, Chanyeong Kim, Yonghak Lee, Seongmin Yang, and Sunghwan Kim. 2023. "A Multi-View Integrated Ensemble for the Background Discrimination of Semi-Supervised Semantic Segmentation" Applied Sciences 13, no. 24: 13255. https://doi.org/10.3390/app132413255

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop