Dual-Branch Multi-Scale Relation Networks with Tutorial Learning for Few-Shot Learning

Xu, Chuanyun; Wang, Hang; Zhang, Yang; Zhou, Zheng; Li, Gang

doi:10.3390/app14041599

Open AccessArticle

Dual-Branch Multi-Scale Relation Networks with Tutorial Learning for Few-Shot Learning

¹

College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China

²

School of Artificial Intelligence, Chongqing University of Technology, Chongqing 401135, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(4), 1599; https://doi.org/10.3390/app14041599

Submission received: 11 January 2024 / Revised: 2 February 2024 / Accepted: 13 February 2024 / Published: 17 February 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Few-shot learning refers to training a model with a few labeled data to effectively recognize unseen categories. Recently, numerous approaches have been suggested to improve the extraction of abundant feature information at hierarchical layers or multiple scales for similarity metrics, especially methods based on learnable relation networks, which have demonstrated promising results. However, the roles played by image features in relationship measurement vary at different layers, and effectively integrating features from different layers and multiple scales can improve the measurement capacity of the model. In light of this, we propose a novel method called dual-branch multi-scale relation networks with tutoring learning (DbMRNT) for few-shot learning. Specifically, we first generate deep multiple features using a multi-scale feature generator in Branch 1 while extracting features at hierarchical layers in Branch 2. Then, learnable relation networks are employed in both branches to measure the pairwise similarity of features at each scale or layer. Furthermore, to leverage the dominant role of deep features in the final classification, we introduce a tutorial learning module that enables Branch 1 to tutor the learning process of Branch 2. Ultimately, the relation scores of all scales and layers are integrated to obtain the classification results. Extensive experiments on popular few-shot learning datasets prove that our method outperforms other similar methods.

Keywords:

few-shot learning; metric learning; multi-scale features; relation networks; dual-branch tutorial learning

1. Introduction

Due to the availability of a large amount of labeled data, deep learning yields impressive results in the field of computer vision [1,2,3,4]. However, training deep learning models often relies on a significant amount of labeled data. When faced with limited data and sparse samples, the model is prone to overfitting, resulting in poor generalization ability. This situation does exist in practical problems, such as image data of rare species and medical images. In contrast, humans possess powerful learning and cognitive abilities that allow them to establish knowledge of new concepts through learning only a few examples or even a single concept [5]. Therefore, inspired by human cognitive abilities, determining how to enable machines to establish effective cognition and generalize to new categories through training with only a few samples is one of the significant research issues in the field of deep learning, known as few-shot learning.

Few-shot learning refers to the process of training a model with limited available examples, enabling the model to generalize to new categories and establish robust cognition, simulating the human process of connecting prior knowledge with novel concepts. In the context of few-shot classification problems, a classification model is trained using limited data and then tasked with classifying new categories that were not encountered during training, thus necessitating strong generalization capabilities. To address such challenges, numerous exemplary approaches have been proposed, which can be roughly categorized into three groups: meta-learning-based, data-augmentation-based, and metric-learning-based methods.

Revolving around the meta-learning paradigm, approaches based on meta-learning train a meta-learner across tasks, focusing on training at the task level rather than the sample level. Its objective is to update the meta-learner with task-agnostic knowledge acquired to obtain optimal parameters that can adapt to novel tasks. On the other hand, data-augmentation-based methods attempt to mitigate the problem of limited sample data by generating new data from the existing samples, utilizing additional information, or enhancing sample representations. Meanwhile, metric-learning-based methods focus on constructing an embedding space where robust features are extracted and learning an informative similarity metric in the embedding space. Samples of the same class have a higher similarity metric, while it is lower for samples of different classes. These methods have achieved remarkable success, among which the metric-learning-based methods are the simplest and most effective. In [6,7,8], they first map the images into a feature space and then use predefined distance functions such as Euclidean distance and cosine distance to compute the similarity between query and support image in pairs for classification. Sung et al. [9] proposed a relation network to measure the similarity between samples, which can obtain a suitable distance metric instead of a fixed distance metric through learning. However, the relation network is known sensitive to spatial location, which may result in the misalignment of the compared objects within a certain region. To alleviate this issue, Wu et al. [10] adopted the deformable convolution combined with a dual correlation attention mechanism. Abdelaziz et al. [11] proposed a Kronecker-product module that maps and fuses positionwise correlations between the query and support feature pairs. However, these methods still fail to fully exploit the limited data and ignore a wealth of valuable information, such as the image features at different layers in deep learning models. These features contain distinct image information that is useful for classification. Deeper layers can capture more comprehensive semantic information about the object [12], while shallow layers can learn texture features such as lines and contours [13], which also contribute to enhancing recognition capabilities.

Many methods have been proposed to explore feature information at different layers to better utilize limited samples. Zhang et al. [14] learned a similarity metric of feature pairs at hierarchical layers using a relation network with an attention mechanism and propagated the relations to the next layer. Jiang et al. [15] extracted image features at different layers in the network, fused them with the transposed convolution upsampling method, and then measured the similarity to obtain the final classification results. Wang et al. [16] measured features at multiple layers and introduced a voting mechanism to determine the final classification results. Wu et al. [17] applied relation networks to measure the coarse-grained classification of shallow features and then measured the fine-grained classification of deep features. Similarly, some methods alleviate the problem of data scarcity by capturing multi-scale feature information. Zhang et al. [18] extracted features from original images at different scales and transformed them into second-order features to learn a similarity metric. Chen et al. [19] performed multi-scale transformations on image features to obtain multiple scales of features and learn task-relevant feature representations at each scale, and a similarity-to-class module was employed to achieve classification. Wang et al. [20] devised a multi-scale label propagation network and obtained the final label propagation scores and classification results through a weighted calculation method. Abdelaziz et al. [11] resized the original image into different scales and then trained a shared relation network with a Kronecker-product module for the similarity metric.

Utilizing multi-layer and multi-scale features may relieve the pressure of limited data in few-shot learning [14,15,16,17,18,19,20], while deeper features play a dominant role in the final classification [12,16], but the contribution of shallow features cannot be ignored [13,17]. Therefore, we propose a method named dual-branch multi-scale relation networks with tutoring learning for few-shot learning that effectively utilizes feature information at different layers and fully leverages the dominant role of deep feature information in classification decisions. Specifically, we learn a similarity metric between feature pairs on each deep scale in Branch 1 and on each layer in Branch 2. Recognizing the potential of the metric learning capability of the relation network, in this paper, we adopt relation networks for similarity measurement in both branches [9,11], allowing the network to learn a suitable metric instead of being limited to a predefined distance metric. Furthermore, simply separating the relation networks to learn the similarity of feature pairs at different layers without considering the differences in feature information between layers overlooks the dominant role of deep features. Therefore, we propose using the relation network in Branch 1, which has learned more advanced feature pair similarities, to tutor the relation networks in Branch 2 to learn similarities of shallower feature pairs, resulting in improved classification performance. The contributions of this work can be summarized as follows:

•: We propose a dual-branch multi-scale metric learning method that combines deep multi-scale features and features at different layers and employee relation networks for similarity metrics.
•: A tutoring learning module is proposed, which involves utilizing a deep relation network in Branch 1 that has acquired advanced knowledge to tutor each shallow relation network in Branch 2 to learn similarity metrics, thereby enhancing the metric capability of relation networks.
•: The final classification performance obtained by our dual-branch multi-scale relation metric and tutoring learning outperforms similar methods on few-shot standard datasets.

2. Related Work

2.1. Meta-Learning-Based Methods

This type of method follows the meta-learning paradigm to obtain an excellent meta-learner across tasks by training task-level meta-learners, which in turn trains a meta-learner adequate to adapt quickly to novel tasks and acquire optimal parameters, also known as learning to learn [21]. Finn et al. [22] searched for an appropriate initialization parameter setting of a model and effectively fine-tuned it with a few gradient descent update steps. Munkhdalai et al. [23] employ a recurrent neural network to iterate over samples to gather the needed knowledge in both the base learner and meta-learner for a given problem. Ravi et al. [24] trained an optimizer based on LSTM to achieve more efficient fine-tuning. In contrast, Oreshkin et al. [25] incorporated task-specific knowledge into the feature encoder by scaling and translating the image features and employed auxiliary task cooperation training to acquire a task-dependent metric space while alleviating the complexity of network training. Similarly, Cai et al. [26] utilized memory slots to forecast the parameters of their feature encoder for classifying unlabeled images. Ren et al. [27] used a recurrent neural network based on an attention mechanism to achieve dynamic comparison between samples, whereas Mishra et al. [28] employed a temporal convolutional network with a soft attention mechanism to encode acquired knowledge into a memory module, subsequently utilizing the memory module for targeted information retrieval and classification.

Compared to such methods, the proposed DbMRNT method does not involve designing intricate meta-learners or utilizing external memory modules but achieves superior performance through straightforward end-to-end training.

2.2. Data Augmentation-Based Methods

Methods belonging to this category aim to alleviate the problem of limited samples by obtaining more available samples or information. Antoniou et al. [29] employed generative adversarial networks to generate new samples. Zhang et al. [30] utilized generative adversarial networks to provide additional training signals for classifiers, clarifying the decision boundary. Hariharan et al. [31] employ the concept of generalizing from intraclass sample variations to different categories and utilize generative models for data augmentation. Wang et al. [32] created new samples by incorporating noise into original images and considered the generative and classification models as a whole for simultaneous updates. Zhang et al. [33] introduced the saliency-guided hallucination method to generate new mixed background-foreground samples for additional training. Chen et al. [34] first employed a meta-learner and a deformation network to achieve image deformation and expand the dataset and then proposed utilizing semantic information to expand the data, employing an encoder to map the feature space onto the semantic space for data augmentation, and subsequently leveraging a decoder to reduce the expanded semantic information back into the feature space [35]. Xing et al. [36] proposed an adaptive modal mixing mechanism to integrate image feature information and semantic information, thereby enhancing the performance of the model. Meanwhile, Schwartz et al. [37] introduce a weighted fusion approach to combine multiple semantic information and visual prototypes for generating the ultimate fusion prototype used in similarity measurement and classification.

In contrast to data-augmentation-based methods, our method refrains from expanding the dataset or leveraging additional information such as natural label information or image description details. Instead, we aim to fully exploit the limited samples to the best extent possible.

2.3. Metric-Learning-Based Methods

This type of method typically evaluates the pairwise similarity of features between support and query samples in a feature embedding space, wherein high similarity indicates samples belonging to the same category, while low similarity suggests different categories. The general similarity measures encompass fixed metrics like Euclidean distance and cosine similarity distance or a learnable network. Koch et al. [6] introduce deep neural networks into few-shot learning, where they learn the features of a pair of samples through weight-shared convolutional networks and measure similarity using Euclidean distance. Vinyals et al. [8] encoded the support set and query set using different LSTM networks, measured feature similarity with an attention-weighted metric function and proposed a classic training mechanism for few-shot learning scenarios named the

e p i s o d e

mechanism. Snell et al. [7] proposed the existence of a class prototype in each class’s feature space, obtained the prototype representation by calculating the mean of same-class samples, and performed classification by measuring distances between query samples and class prototypes. Sung et al. [9] suggest replacing traditional predefined nonparametric measurement methods with a learnable neural network, allowing the network to autonomously learn an appropriate similarity metric. Differing from previous approaches that rely on image-level features for similarity calculation [6,7,8], Li et al. [38] propose the concept of local descriptors to replace image-level feature metrics; it classifies by measuring the k-nearest neighbors of the local descriptors between the query sample and the categories. Xue et al. [39] propose assigning higher weights to regions with greater similarity between images by calculating feature similarities individually for each region in comparison to the query image and subsequently combining scores across all regions for classification decisions. Zhang et al. [40] transform feature representation into a second-order feature representation to facilitate similarity metrics and then devise a metric strategy for multi-scale relation networks based on this foundation [18]. To address the issue of relation networks being sensitive to the spatial positions of target objects, Wu et al. [10] incorporated deformable convolution into the feature extractor and combined it with a dual attention mechanism. Additionally, Xue et al. [41] introduced relative position and map networks based on an attention mechanism to determine the significance of each spatial location in the image features during comparison and designed a nonlinear metric using a relative map network module. In refs [14,15,16] extract features of different layers to measure similarity, demonstrating that feature maps at hierarchical layers possess distinct classification characteristics and contributions, while [11,18,19,20,42] employ multi-scale feature extraction techniques to capture a broader range of feature information and boost the accuracy of classification. Simultaneously, there exist numerous methods to enhance model performance by integrating attention mechanisms [43,44,45,46,47,48,49] and methods that explore intra-class and inter-class relations [50,51,52,53,54] all falling under the category of metric-learning-based methods.

Our method falls under the category of metric-learning-based methods and shares similarities with multi-layer methods [14,15,16] and multi-scale methods [18,19,20,42]. We evaluate our methodology against them and several methodologies as baselines. The proposed method adopts a dual-branch structure, maximizing the utilization of feature information from different layers and scales. Additionally, learnable relation networks are employed instead of a fixed metric function for similarity measurement. The main difference lies in the branch-guided paradigm proposed in this paper, which enables the metric network to better learn how to measure similarity and effectively improve classification accuracy. The most important aspect is that the proposed tutoring learning approach enables the relation networks to learn a similarity metric better, effectively improving classification accuracy.

3. Methodology

3.1. Problem Definition

Due to the scarcity of labeled data, traditional image classification training methods are not suitable for few-shot learning tasks. To address this issue, a training mechanism known as episode [8] is widely adopted and recognized as an effective approach for solving few-shot learning problems. This training mechanism facilitates learning by constructing numerous meta-tasks. Typically, the dataset is divided into training, testing, and validation sets with a crucial consideration that the categories in the training set are mutually exclusive from those in the testing and validation sets. As shown in Figure 1, training tasks are formulated based on the training set, while test tasks are constructed from the testing set. Each task is further comprised of support set S and query set Q. During the training process, a set of C categories is randomly generated from the training set, and K samples are sampled from every category to create the support set. Subsequently, M samples are sampled from the remaining samples in each category of the selected C categories to create the query set, thereby constructing a C-way K-shot classification task. During the inference stage, we carry out the same sampling as in the training stage, as described above, the only difference is that the query set samples in the inference stage are unlabeled. One meta-task is learned in each training iteration, referred to as an episode, which comprises the support set

S = {\{(x_{s}, y_{s})\}}_{s = 1}^{n}

and the query set

Q = {\{(x_{q}, y_{q})\}}_{q = 1}^{m}

, where

n = C \times K

and

m = C \times M

.

The few-shot classification task aims to classify the samples in the query set into their respective categories by learning these meta-tasks, and during the testing phase, a C-way K-shot testing task is constructed from the testing set, leveraging the knowledge acquired during the training phase for classification.

3.2. Model Overview

We propose a novel end-to-end metric learning model named dual-branch multi-scale relation networks with tutoring learning to address the challenge of few-shot classification problems. Our method is based on the concept of multiple layers and multi-scale analysis while incorporating relation network metrics, and the model framework is illustrated in Figure 2.

As shown in Figure 2, our model has a dual-branch structure, which is mainly composed of four modules: a feature extractor

F_{θ}

, a multi-scale feature generator

G_{ϕ}

, a relation network module

R_{φ}

, and a tutoring learning module T. The samples are fed into the feature extractor to obtain their respective feature maps. Subsequently, these feature maps serve as input for the multi-scale feature generator, which generates feature maps at five different scales. Next, the features of the query image and support images from various categories are concatenated in the channel dimension at each scale to form support–query feature pairs. These pairs are then input into the relation network module. In this branch, multi-scale feature pairs use a shared relation network for similarity metrics, resulting in separate relation scores and losses at each scale. The relation score is a five-dimensional vector indicating the probability that the query image belongs to one of five categories. In Branch 2, we extract two groups of feature maps from the middle and last layers of the feature extractor, as well as an additional group of deeper feature maps from the fifth subbranch of the multi-scale feature generator. Subsequently, these three groups of feature maps are concatenated in the channel dimension to obtain support–query feature pairs, which are then fed into the relation network module in Branch 2. Relation scores and losses at hierarchy layers are obtained by training three weight-unshared relation networks. Additionally, to better utilize the relation networks in measuring shallow feature pair information and contribute to overall performance, we propose a tutoring learning module T. In this module, the knowledge summarization of Branch 1 and Branch 2 is obtained by aggregating their own relation scores. Subsequently, the richer knowledge acquired by the relation network in Branch 1 is transferred to Branch 2 by constructing soft labels, facilitating the tutoring process from Branch 1 to Branch 2.

3.3. Generation of Multi-Scale and Multi-Layer Features

The support image

x_{s i}

and query image

x_{q j}

are fed into the feature extractor

F_{θ}

to obtain two groups of feature representations with different layers:

Φ_{s}^{d} = {\{F_{θ}^{d} (x_{s})\}}_{d = 1}^{2}

and

Φ_{q}^{d} = {\{F_{θ}^{d} (x_{q})\}}_{d = 1}^{2}

, which serve as the feature representations for different depths in Branch 2, while the deeper features

Φ^{2}

are utilized as input for the multi-scale feature generator.

Generating multiple scale features helps to obtain multiple unique representations of features from original images to mitigate the problem of limited data in few-shot situations. This paper proposes a multi-scale feature generator inspired by [19]. Additionally, to learn deeper features while reducing the size of the feature maps, a 3 × 3 convolutional layer is added before the 2 × 2 max-pooling layer in the fifth subbranch. As shown in Figure 3, the multi-scale feature generator consists of five subbranches: the first subbranch does nothing; the second subbranch contains a 3 × 3 convolutional layer; the third subbranch contains a 5 × 5 convolutional layer; the fourth subbranch contains a 1 × 7 convolutional layer before a 7 × 1 convolutional layer; and the fifth subbranch contains a 3 × 3 convolutional layer and a 2 × 2 max-pooling layer.

With the help of the multi-scale feature generator, we obtain deep multi-scale features for both support and query images

Φ_{s}^{z} = {\{G_{ϕ}^{z} (F_{θ}^{2} (x_{s}))\}}_{z = 1}^{5}

and

Φ_{q}^{z} = {\{G_{ϕ}^{z} (F_{θ}^{2} (x_{q}))\}}_{z = 1}^{5}

and then feed them into the relation network module in Branch 1. Simultaneously, the outputs from the fifth part of the multi-scale feature generator,

Φ_{s}^{5} = G_{ϕ}^{5} (F_{θ}^{2} (x_{s}))

and

Φ_{q}^{5} = G_{ϕ}^{5} (F_{θ}^{2} (x_{q}))

, will serve as the deepest features

Φ^{3}

in Branch 2, alongside the features

Φ_{s}^{d} = {\{F_{θ}^{d} (x_{s})\}}_{d = 1}^{2}

and

Φ_{q}^{d} = {\{F_{θ}^{d} (x_{q})\}}_{d = 1}^{2}

extracted by the feature extractor at the different layers. This combination enables pairwise similarity measurements at hierarchy layers within Branch 2.

3.4. Relation Networks Module

A relation network [9] is employed to supersede the predefined distance metric and learn a suitable approach to learn the similarity metric between support and query feature pairs. Therefore, the support feature of each class is concatenated with the query feature in the channel dimension after obtaining the output of the previous module, and then the support–query feature pairs are fed into this module to obtain the ultimate relation scores.

The proposed relation networks module is designed differently in the dual-branch structure. In Branch 1, there are 5 different scales. In each scale, the query feature is spliced with the support features of each category, and the feature pairs are fed into the weight-shared relation network. However, the common relation network [9] is known sensitive to the spatial location of the compared objects [10]. We employ an improved relation network [11] in Branch 1, which incorporates a Kronecker-product (

K P

) module [55,56] to acquire positionwise correlation maps between feature pairs for more enhanced metric learning and improved generalization capability. Initially, we are given two original features to form a support–query feature pair from the identical scale., and the

K P

module is applied to generate spatial correlation maps:

{\hat{β}}_{(Φ_{s}^{z}, Φ_{q}^{z})} = K P (Φ_{s}^{z}, Φ_{q}^{z})

and

{\hat{β}}_{(Φ_{q}^{z}, Φ_{s}^{z})} = K P (Φ_{q}^{z}, Φ_{s}^{z})

by applying the feature maps. Next, the spatial correlation maps created are concatenated with the original feature maps at each scale, and we have a relation score as follows:

r_{s, q}^{z} = R_{φ} (C o n c a t e (Φ_{s}^{z}, {\hat{β}}_{(Φ_{s}^{z}, Φ_{q}^{z})}, {\hat{β}}_{(Φ_{q}^{z}, Φ_{s}^{z})}, Φ_{q}^{z})), z \in \{1, . . ., 5\}

(1)

where z denotes the different scales, and

C o n c a t e (\cdot)

refers to the operation of splicing the features in the channel dimension. Therefore, for a C-way K-shot task, the total loss of Branch 1 is:

L_{Z} = \sum_{z = 1}^{5} \sum_{i = 1}^{n} \sum_{j = 1}^{m} ℓ^{C E} (r_{i, j}^{z}, 1 (y_{i} = y_{j}))

(2)

where

ℓ^{C E}

denotes the cross-entropy loss,

n = C \times K

refers to the number of support images and

m = C \times M

refers to the number of query images in an episode.

r_{i, j}^{z}

is the relation score belonging to support image

x_{s i}

and query image

x_{q j}

.

1 (\cdot)

is equal to 1 when the condition is satisfied; otherwise, it equals 0.

In the relation network module of Branch 2, the inputs consist of two sets of feature representations:

Φ_{s}^{d} = {\{F_{θ}^{d} (x_{s})\}}_{d = 1}^{2}

and

Φ_{q}^{d} = {\{F_{θ}^{d} (x_{q})\}}_{d = 1}^{2}

extracted by the feature extractor at different depths, as well as a set of feature representations

Φ_{s}^{5}

and

Φ_{q}^{5}

outputted by the fifth part of the multi-scale feature generator. The specific details are illustrated in Figure 4. At each layer, the query image features are concatenated with the support image features of each class and fed into the relation networks for similarity measurement. Due to the relatively large feature size at the low layer, we omit the introduction of the

K P

module here and solely adopt the relation networks proposed in [9]. Simultaneously, we transfer the relation information from shallow layers to deeper relation networks and concatenate it with deeper feature pairs. Ultimately, we obtain relation scores from three distinct layers:

r_{i, j}^{d} = R_{φ}^{d} (C o n c a t e (Φ_{i}^{d}, Φ_{i, j}^{d - 1}, Φ_{j}^{d})), d \in \{1, 2, 3\}, i \in \{1, . . ., n\}, j \in \{1, . . ., m\}

(3)

where d denotes the hierarchy layers,

C o n c a t e (\cdot)

refers to the operation of splicing the features in the channel dimension,

Φ_{i, j}^{d - 1}

refers to the relation information from the previous layer, and the first layer does not have this item. Therefore, the total loss of Branch 2 can be formulated as follows:

L_{D} = \sum_{d = 1}^{3} \sum_{i = 1}^{n} \sum_{j = 1}^{m} ℓ^{C E} (r_{i, j}^{d}, 1 (y_{i} = y_{j}))

(4)

3.5. Tutoring Learning Module

The relation network replaces the conventional measurement method based on predefined functions, enabling the model to learn an appropriate approach for the similarity metric. However, in few-shot learning problems with limited examples, it becomes crucial to consider every detail. Although employing relation networks separately for measuring the similarity of different hierarchical features can enhance model performance, it overlooks the distinctions and impacts of these different features. As the outcomes of deep features have a significant influence over the final classification, tutoring similarity metrics learning of shallow features from deep feature metrics is essential. To improve the capability of the relation networks model to measure feature pair similarity, we propose a tutoring learning module that utilizes the Branch 1 relation network to instruct the Branch 2 relation networks in learning similarity metrics.

The tutorial learning module is inspired by knowledge distillation [57], which can improve the accuracy of network models. Combining with knowledge distillation techniques can further improve the measurement capabilities of relation networks. Specifically, we consider the relation network in Branch 1 as a teacher network due to the weight-sharing mechanism that enables it to acquire comprehensive deep feature similarity across multiple scales, thereby exhibiting strong measurement capabilities. Conversely, the relation networks in Branch 2 are regarded as student networks since they learned less or basic similarity knowledge, so they can benefit from the tutoring provided by Branch 1. We first sum the relation scores at each scale in Branch 1, which represents the knowledge learned about sample pair similarity. Then, we sum the relation scores in Branch 2, which denotes the knowledge learned by Branch 2:

r^{Z} = \sum_{z = 1}^{5} r^{z}, r^{D} = \sum_{d = 1}^{3} r^{d}

(5)

Obviously, the weight-shared relation network in Branch 1 learns more comprehensive knowledge at different scales in deep layers, exhibiting robust recognition capabilities, while the relation networks in Branch 2 independently learn and primarily rely on shallow and fundamental features, serving as a supplementary component for final classification. Therefore, this paper proposes a tutoring learning module T to facilitate the transfer of knowledge acquired in Branch 1 to Branch 2, thereby enhancing the overall recognition capability. We leverage the learning summary

r^{Z}

from Branch 1 to generate a soft label for tutoring the similarity metric in Branch 2, incorporating hyperparameter t for appropriate regularization. The specific implementation is as follows:

L_{T} = ℓ^{K L} (s o f t m a x (r^{Z} / t), s o f t m a x (r^{D} / t))

(6)

where

ℓ^{K L}

denotes the Kullback–Leibler divergence, and

s o f t m a x (\cdot)

indicates normalization. The loss function of the tutoring learning module is:

L_{K D} = α L_{D} + (1 - α) L_{T}

(7)

where

α

denotes the weight hyperparameter, which is set to 0.3 in this paper to control the contribution of the two parts in the distillation loss. Therefore, the total loss of our model is composed of the losses of Branch 1, Branch 2 and the tutoring learning module, and can be formulated as follows:

L = min_{θ, ϕ, φ} (L_{Z} + L_{D} + L_{K D})

(8)

where the ultimate classification result is determined by the sum of the relation scores of all scales and layers, as depicted below:

r^{f i n a l} = r^{Z} + r^{D} / 3

(9)

4. Experiments and Results

To assess the efficacy of our approach, we conducted experiments on commonly employed few-shot learning datasets, namely, miniImageNet and tieredImageNet, as well as a fine-grained image classification dataset called Stanford Cars. Subsequently, we compared our approach with other state-of-the-art approaches.

4.1. Datasets

MiniImageNet [8] is a subset of ImageNet [58]. There are 100 classes with 600 pictures in each class, totaling 60,000 color pictures. There are 64 classes prepared for training, while 16 categories and 20 categories are prepared for validating and testing, respectively.

TieredImageNet [27] comprises 779,165 images distributed across 608 categories (34 high-level categories) that are hierarchically sampled from the 1000 categories of ImageNet. In the training set, there are 351 categories (20 high-level categories), while the validation and test sets consist of 97 categories (6 high-level categories) and 160 categories (8 high-level categories), respectively. Due to its distinct classification characteristics compared to miniImageNet, this dataset poses greater challenges in few-shot learning.

Stanford cars [59] is a primarily used fine-grained dataset. It includes 16,185 color images of 196 categories of cars. There are 130 classes prepared for training, while 17 categories and 49 categories are prepared for validating and testing, respectively.

To ensure a fair comparison, all of our experiments resize images to a size of 84 × 84 dimensions, and all the datasets are divided according to the standard protocol. In this paper, miniImageNet was divided according to [8], tieredImageNet was divided according to [27], and Stanford Cars was divided according to [38]. The splits of the datasets are shown in Table 1.

4.2. Network Architecture

For a fair comparison, we employ the identical feature extractor

F_{θ}

as utilized by the majority of few-shot learning methods.

F_{θ}

consists of four convolutional layers, each comprising a convolution block with 64 channels, a 3 × 3 convolution kernel, a BatchNorm layer, and a ReLU layer. Among these layers, only the first two contain a 2 × 2 max-pooling layer at the end. The input image size is set to 84 × 84 × 3. The multi-scale feature generator comprises five parallel parts, as illustrated in Figure 2. The relation networks module adopts the main structure of RN [9], which consists of two convolution blocks and two fully connected layers, each with 64 channels. Each layer utilizes a 3 × 3 convolution kernel, a BatchNorm layer, a ReLU layer, and a 2 × 2 max-pooling layer. The output size of the last convolution block is set to 64. The dimensionality of the first fully connected layer is reduced to 8, followed by a ReLU layer. Finally, the last fully connected layer outputs a one-dimensional classification score. The relation networks module in this paper consists of two branches. The relation network in Branch 1 incorporates an adaptive global average pooling layer to resize the input feature map to 10 × 10, and a Kronecker-product module is used to obtain spatial correlation maps of feature pairs. Meanwhile, the input of Branch 2 does not undergo a global average pooling layer, and because of the different sizes of feature maps, an adaptive global average pooling layer is employed before the fully connected layer.

4.3. Setup

Our experiment followed the C-way K-shot episode training mechanism [8] and was conducted in both 5-way 1-shot and 5-shot settings. In the 5-way 1-shot setting, each training episode consisted of randomly sampled images from five categories, with 1 support sample and 15 query samples per category, resulting in a total of eighty images (5 × 1 + 5 × 15 = 80). For the 5-way 5-shot setting, each training episode included five support images per category and fifteen query images per category, totaling one hundred images (5 × 5 + 5 × 15 = 100). The Adam optimizer was employed with an initial learning rate of 0.001 halved every 150,000 episodes. A total of 500,000 randomly sampled episodes are used for training. For data augmentation, we applied random resize crop, random color jittering, random horizontal flipping, and random erasing during training following the strategy described in [11]. All experiments were trained end-to-end without any fine-tuning during the testing phase. During the inference stage, we selected 15 query images per category in each episode for both 5-way 1-shot and 5-shot classification in the test set, which is the same as the training stage. The final classification accuracy was calculated as an average of over 600 test episodes with a confidence interval of 95%.

4.4. Results

4.4.1. Results of the Main Datasets

We compare our method with current metric-based methods [7,8,9,10,11,19,20,39,40,41,53] as well as some meta-learning based methods [22,23,24,26,27,28]. All the images of our experiments are resized to 84 × 84 dimensions, and a feature extractor with four convolutional layers is employed.

The experimental results of our model on the main datasets are showed in Table 2 and Table 3, respectively. It can be observed from the table that the proposed dual-branch multi-scale relation networks with tutoring learning outperform other metric-based methods and similar multi-scale approaches in both 5-way 1-shot and 5-shot scenarios.

Our proposed method makes fuller use of image information at hierarchical layers, such as deep multi-scale information in Branch 1 and feature information at different layers in Branch 2. The rich knowledge learned makes the model have robust recognition and classification ability. Compared with MSDN [16] with only different depth features, our method improves by approximately 6% and 5.7% in 5-way 1-shot and 5-way 5-shot scenarios, respectively, on the miniImageNet dataset. At the same time, compared with the method MsKPRN [11], which also adopts multi-scale and relation network metrics, our method improves by 1.8% in 5-way 1-shot and 2.13% in 5-shot. Compared with ABNet [49], which splits the image into significant patches and combines the attention mechanism, our method does not process original images too much, nor does it introduce the attention mechanism, but improves by 0.7% and 2.17% in 1-shot and 5-shot settings, respectively.

On the tieredImageNet dataset, our method achieves a performance improvement of 2.37% and 6.58% for 1-shot and 5-shot scenarios, respectively, compared with DCN [14], which employs multi-scale measurements at different depths. In comparison with MRN [54], which also utilizes relation networks for similarity metrics and incorporates external memory units to store information and class relationships, our model demonstrates only a marginal enhancement of 0.3% but improved by 4.8% in 5-shot scenarios. This is because our method can conduct tutoring learning of the dual-branch relation networks module, which uses the relation network containing more high-dimensional and global information to tutor the relation networks containing more low-dimensional and local information, transfers the learned knowledge between different layers of relation networks, and further improves the task adaptive measurement ability of relation networks, which is more robust and has higher classification accuracy than a single measurement method.

4.4.2. Results of the Fine-Grained Dataset

Fine-grained datasets pose greater challenges than standard datasets due to small inter-class differences and large intra-class variations [11]. Our method outperforms MsKPRN, DN4, GNN, Relation Nets, and other methods in both scenarios. Compared to the second-highest performing method, MsKPRN, our accuracy is higher by 1.66% and 0.82% in the respective scenarios. While MsKPRN processes the original image to a size of 84 × 84 and scales it to 64 × 64 and 48 × 48 as inputs to obtain deep multi-scale feature information and improves relation networks for classification tasks, our approach employs an improved relation network as a metric network for Branch 1 while using Branch 2 to extract features of different depths from the image. Experimental results in Table 4 show the effectiveness of our dual-branch design and tutoring learning module.

5. Discussion

We perform ablation experiments on the miniImageNet dataset in this section to objectively analyze the effectiveness of our suggested dual-branch multi-scale structure and the tutorial learning module.

5.1. The Effect of the Dual-Branch Structure

The classification effect of each branch in few-shot scenarios is investigated in experiments, and the results are presented in Table 5. From the table, it can be observed that the combined model of both branches exhibits improved performance compared to either individual Branch alone in few-shot classification tasks. In the 5-way 1-shot setting, there is a 0.37% improvement over Branch 1 and a 1.3% improvement over Branch 2. Similarly, in the 5-way 5-shot setting, there is a respective improvement of 0.61% and 3.4% compared to Branch 1 and Branch 2. The experiments validate that deep features (Branch 1) containing richer semantic information contribute significantly to final classification outcomes in multi-scale settings while incorporating shallow information enhances overall performance. Notably, our approach employs a tutoring learning module T where we utilize Branch 1 to tutor the learning process of Branch 2 by capturing the similarity between support and query image pairs, resulting in remarkable improvements in classification accuracy by approximately +0.4% for the 1-shot setting and +0.3% for the 5-shot setting. Our experiments demonstrate the efficacy of our dual-branch architecture combined with a tutoring learning module, which also benefits from the adaptable learning of similarity metrics facilitated by relation networks. Our method harnesses the potential inherent in this type of measurement.

5.2. The Effect of Information Transfer between Layers

In this experiment, we conduct relevant experiments in 5-way 1-shot and 5-shot scenarios to validate the classification effectiveness of selecting and processing intermediate features at different layers in Branch 2. The results are presented in Table 6. The baseline model represents a dual-branch structure with a tutoring learning module. In this model, the shallowest layer of Branch 2 is obtained from the output of the first layer of the feature extractor. ‘*’ indicates that the second layer of the feature extractor is chosen to form the shallowest layer of Branch 2 while delaying its max-pooling layer. Compared to our baseline model, when selecting features from the second layer, our model demonstrates an improvement in classification performance by 0.94% and 0.14% for the 5-way 1-shot and 5-shot settings, respectively. These experimental findings indicate that further processing of shallow feature information contributes to enhanced classification accuracy and highlights the crucial impact of selecting shallow features as well. Additionally, Table 6 reveals that knowledge transfer between internal layers within the relation network module in Branch 2 also aids in improving accuracy (by approximately +0.41%) for the case of a 5-way 1-shot scenario; however, such knowledge transfer has a negative effect when dealing with a more data-rich setting such as a 5-shot scenario (i.e., having five labeled examples per class). Further analysis suggests that this occurs because under such settings, where models learn relatively robust prototypes, higher-level features carrying semantic information exert greater influence on final classification decisions. Therefore, the information transferred from shallower layers instead affects overall decision-making.

5.3. The Effect of Hyperparameter t

The primary contribution of our approach lies in the tutoring learning module, which significantly enhances classification accuracy. Leveraging the learnability of relation networks, the tutoring learning module employs the relation network in Branch 1 to tutor relation networks in Branch 2 in learning image pair similarities. Specifically, the knowledge acquired by Branch 1 is utilized to generate soft labels and is subsequently transferred to Branch 2. We design a hyperparameter t to control the degree of softening. The experimental results are presented in Table 7.

t = 1

represents no softening processing, and the larger the value of t is, the greater the degree of softening and the smaller the category difference. The results indicate that the optimal classification performance is achieved at

t = 20

in the 5-way 1-shot scenario, whereas it occurs at

t = 7

in the 5-shot scenario. This discrepancy can be attributed to the heightened scarcity of samples in the 1-shot case, leading the model to exploit additional inter-class information, which is mitigated in the 5-shot case.

6. Conclusions

In this paper, we propose a novel few-shot learning method for effectively integrating deep multi-scale feature information and feature information from multiple hierarchical layers. The proposed method, DbMRNT, maximizes the utilization of limited data in the few-shot learning problem and enhances the accuracy of classification. The introduced tutoring learning approach capitalizes on the learnability of the relation network, leveraging the deep and comprehensive knowledge acquired by the relation network in Branch 1 to tutor the shallow relation networks in Branch 2 in learning simpler information, thereby further enhancing classification performance in the few-shot learning problem. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed method.

For future work, we intend to explore other backbone networks and utilize larger images, design ways to capture effective target regions in images to better leverage relation networks, and explore intra-class and inter-class information at multi-scale and hierarchical feature information.

Author Contributions

All authors significantly contributed to the research. Conceptualization, C.X. and H.W.; methodology, H.W.; software, C.X. and H.W.; validation, C.X., H.W., Y.Z. and G.L.; formal analysis, H.W.; investigation, H.W.; resources, C.X. and H.W.; data curation, H.W.; writing—original draft preparation, H.W.; writing—review and editing, C.X., H.W., Y.Z., Z.Z. and G.L.; visualization, H.W. and Z.Z.; supervision, C.X., Y.Z. and G.L.; project administration, C.X. and H.W.; funding acquisition, C.X., Y.Z. and G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China Chongqing Science and Technology Commission, grant number cstc2020jscx-msxmX0086; the Chongqing University of Technology graduate education high-quality development project, grant number gzlsz202304; the Chongqing University of Technology First-class undergraduate project; the Chongqing University of Technology undergraduate education and teaching reform research project, grant number 2023YB124; the Chongqing University of Technology—Chongqing LINGLUE Technology Co., Ltd. Electronic Information (artificial intelligence) graduate joint training base; the Postgraduate Education and Teaching Reform Research Project in Chongqing, grant number yjg213116; and the Chongqing University of Technology—CISDI Chongqing Information Technology Co., Ltd. Computer Technology graduate joint training base.

Data Availability Statement

The experiments are evaluated on publicly open datasets. The datasets can be accessed in their corresponding published papers. Our code is available at https://github.com/shepherd0/db-MRNT.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Lin, Z.; Liu, J. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Biederman, I. Recognition-by-components: A theory of human image understanding. Psychol. Rev. 1987, 94, 115–147. [Google Scholar] [CrossRef] [PubMed]
Koch, G.; Zemel, R.S.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2041–2049. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R.S. Prototypical networks for few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4077–4087. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K. Matching networks for one shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3637–3645. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Pan, P.S.; Torr, P.H. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1199–1208. [Google Scholar]
Wu, Z.; Li, Y.; Guo, L.; Jia, K. PARN: Position-Aware Relation Networks for Few-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Mounir, A.; Zuping, Z. Multi-scale kronecker-product relation networks for few-shot learning. Multimed Tools Appl. 2022, 81, 6703–6722. [Google Scholar] [CrossRef]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 818–833. [Google Scholar]
Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–8 May 2019; pp. 1–21. [Google Scholar]
Zhang, X.; Qiang, Y.; Sung, F.; Yang, Y.; Hospedales, T. RelationNet2: Deep Comparison Network for Few-Shot Learning. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
Jiang, W.; Huang, K.; Geng, J.; Deng, X. Multi-Scale Metric Learning for Few-Shot Learning. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 1091–1102. [Google Scholar] [CrossRef]
Wang, X.; Ma, B.; Yu, Z.; Li, F.; Cai, Y. Multi-Scale Decision Network With Feature Fusion and Weighting for Few-Shot Learning. IEEE Access 2020, 8, 92172–92181. [Google Scholar] [CrossRef]
Wu, Z.; Zhao, H. Hierarchical Few-Shot Learning Based on Coarse- and Fine-Grained Relation Network. Artif. Intell. Rev. 2022, 56, 2011–2030. [Google Scholar] [CrossRef]
Wang, Y.; Li, Y.; Xu, C.; Liang, Y.; Pan, S.; Yan, S. Few-shot Learning with multi-scale self-supervision. arXiv 2020, arXiv:2001.01600. [Google Scholar]
Chen, H.; Li, H.; Li, Y.; Chen, C. Multi-Scale Adaptive Task Attention Network for Few-Shot Learning. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 4765–4771. [Google Scholar] [CrossRef]
Wang, H.; Tian, S.; Tang, Q.; Chen, D. Few-Shot Image Classification Based on Multi-Scale Label Propagation. Comput. Res. Dev. 2022, 59, 1486–1495. [Google Scholar]
Thrun, S.; Pratt, L. Learning to learn: Introduction and overview. In Neural Networks for Machine Learning; Springer: Boston, MA, USA, 1998; pp. 3–17. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1126–1135. [Google Scholar]
Munkhdalai, T.; Yu, H. Meta Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 2554–2563. [Google Scholar]
Ravi, S.; Larochelle, H. Optimization as a Model for Few-Shot Learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Oreshkin, B.N.; Rodriguez, P.; Lacoste, A. TADAM: Task Dependent Adaptive Metric for Improved Few-Shot Learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Montréal, QC, Canada, 3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 719–729. [Google Scholar]
Cai, Q.; Pan, Y.; Yao, T.; Yan, C.; Mei, T. Memory Matching Networks for One-Shot Image Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4080–4088. [Google Scholar] [CrossRef]
Ren, M.; Triantafillou, E.; Ravi, S.; Snell, J.; Swersky, K.; Batra, D.; Fergus, R. Meta-learning for semi-supervised few-shot classification. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Mishra, N.; Rohaninejad, M.; Chen, X.; Abbeel, P. A Simple Neural Attentive Meta-Learner. Master’s Thesis, EECS Department, University of California, Berkeley, CA, USA, 2018. [Google Scholar]
Antoniou, A.; Storkey, A.; Edwards, H. Data augmentation generative adversarial networks. arXiv 2017, arXiv:1711.04340. [Google Scholar]
Zhang, R.; Che, T.; Ghahramani, Z.; Bengio, Y.; Song, Y. MetaGAN: An Adversarial Approach to Few-Shot Learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Montréal, QC, Canada, 3–8 December 2018; pp. 2371–2380. [Google Scholar]
Hariharan, B.; Girshick, R. Low-Shot Visual Recognition by Shrinking and Hallucinating Features. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3037–3046. [Google Scholar] [CrossRef]
Wang, Y.X.; Girshick, R.; Hebert, M.; Hariharan, B. Low-Shot Learning from Imaginary Data. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7278–7286. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, J.; Koniusz, P. Few-Shot Learning via Saliency-Guided Hallucination of Samples. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2765–2774. [Google Scholar] [CrossRef]
Chen, Z.; Fu, Y.; Wang, Y.X.; Ma, L.; Liu, W.; Hebert, M. Image Deformation Meta-Networks for One-Shot Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8672–8681. [Google Scholar] [CrossRef]
Chen, Z.; Fu, Y.; Zhang, Y.; Jiang, Y.G.; Xue, X.; Sigal, L. Multi-Level Semantic Feature Augmentation for One-Shot Learning. IEEE Trans. Image Process. 2019, 28, 4594–4605. [Google Scholar] [CrossRef] [PubMed]
Xing, C.; Rostamzadeh, N.; Oreshkin, B.; Pinheiro, P.O. Adaptive cross-modal few-shot learning. In Proceedings of the NeurIPS 2019: Thirty-third Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 4848–4858. [Google Scholar]
Schwartz, E.; Karlinsky, L.; Feris, R.; Giryes, R.; Bronstein, A. Baby Steps towards Few-Shot Learning with Multiple Semantics. Pattern Recogn. Lett. 2022, 160, 142–147. [Google Scholar] [CrossRef]
Li, W.; Wang, L.; Xu, J.; Huo, J.; Gao, Y.; Luo, J. Revisiting Local Descriptor Based Image-To-Class Measure for Few-Shot Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7253–7260. [Google Scholar] [CrossRef]
Xue, Z.; Duan, L.; Li, W.; Chen, L.; Luo, J. Region Comparison Network for Interpretable Few-shot Image Classification. arXiv 2020, arXiv:abs/2009.03558. [Google Scholar]
Zhang, H.; Koniusz, P. Power Normalizing Second-Order Similarity Network for Few-Shot Learning. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019; pp. 1185–1193. [Google Scholar] [CrossRef]
Xue, Z.; Xie, Z.; Xing, Z.; Duan, L. Relative Position and Map Networks in Few-shot Learning for Image Classification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 4032–4036. [Google Scholar] [CrossRef]
Han, M.; Wang, R.; Yang, J.; Xue, L.; Hu, M. Multi-Scale Feature Network for Few-Shot Learning. Multimed. Tools Appl. 2020, 79, 11617–11637. [Google Scholar] [CrossRef]
Hui, B.; Zhu, P.; Hu, Q.; Wang, Q. Self-Attention Relation Network for Few-Shot Learning. In Proceedings of the 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shanghai, China, 8–12 July 2019; pp. 198–203. [Google Scholar] [CrossRef]
Ma, X.; Yu, C.; Yang, X.; Chen, X. Few-Shot Learning Based on Attention Relation Compare Network. In Proceedings of the 2019 International Conference on Data Mining Workshops (ICDMW), Beijing, China, 8–11 November 2019; pp. 658–664. [Google Scholar] [CrossRef]
Tong, Y.; Tian, H.; Jiang, X.; Yin, J. Dual Branch Relation Network with Feature Weighting for Few-Shot Learning. In Proceedings of the 2021 7th International Conference on Computer and Communications (ICCC), Chengdu, China, 10–13 December 2021; pp. 1743–1751. [Google Scholar] [CrossRef]
Hou, R.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Cross Attention Network for Few-Shot Classification. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
Ke, L.; Pan, M.; Wen, W.; Li, D. Compare Learning: Bi-Attention Network for Few-Shot Learning. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2233–2237. [Google Scholar] [CrossRef]
Qin, Z.; Wang, H.; Mawuli, C.B.; Han, W.; Zhang, R.; Yang, Q.; Shao, J. Multi-instance attention network for few-shot learning. Inf. Sci. 2022, 611, 464–475. [Google Scholar] [CrossRef]
Yan, B.; Zhou, C.; Zhao, B.; Guo, K.; Yang, J.; Li, X.; Zhang, M.; Wang, Y. Augmented Bi-path Network for Few-shot Learning. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 8461–8468. [Google Scholar] [CrossRef]
Li, H.; Eigen, D.; Dodge, S.; Zeiler, M.; Wang, X. Finding Task-Relevant Features for Few-Shot Learning by Category Traversal. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1–10. [Google Scholar] [CrossRef]
Su, Y.; Zhao, H.; Lin, Y. Few-shot learning based on hierarchical classification via multi-granularity relation networks. Int. J. Approx. Reason. 2022, 142, 417–429. [Google Scholar] [CrossRef]
Jia, X.; Su, Y.; Zhao, H. Few-Shot Learning via Relation Network Based on Coarse-Grained Granulation. Appl. Intell. 2022, 53, 996–1008. [Google Scholar] [CrossRef]
Satorras, V.G.; Estrach, J.B. Few-Shot Learning with Graph Neural Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
He, J.; Hong, R.; Liu, X.; Xu, M.; Zha, Z.J.; Wang, M. Memory-Augmented Relation Network for Few-Shot Learning. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, Seattle, WA, USA, 12–16 October 2020; pp. 1236–1244. [Google Scholar] [CrossRef]
Shen, Y.; Xiao, T.; Li, H.; Yi, S.; Wang, X. End-to-End Deep Kronecker-Product Matching for Person Re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6886–6895. [Google Scholar] [CrossRef]
Shen, Y.; Xiao, T.; Yi, S.; Chen, D.; Wang, X.; Li, H. Person Re-Identification With Deep Kronecker-Product Matching and Group-Shuffling Random Walk. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1649–1665. [Google Scholar] [CrossRef] [PubMed]
Hinton, G.E.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:abs/1503.02531. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vision 2015, 115, 211–252. [Google Scholar] [CrossRef]
Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3D Object Representations for Fine-Grained Categorization. In Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops, Sydney, NSW, Australia, 2–8 December 2013; pp. 554–561. [Google Scholar] [CrossRef]

Figure 1. Illustration of few-shot episode mechanism (inference framework.), showing a 2-way 1-shot episode setting in which each class includes 3 query images as a task example.

Figure 2. Illustration of the DbMRNT framework, showing a 5-way 1-shot episode setting as an example.

Figure 3. Illustration of multi-scale feature generator.

Figure 4. Illustration of relation networks in Branch 2. The support image feature of a single category (red) is exemplified, while the other categories follow the same pattern by concatenating with the query image feature.

Table 1. The splits of the datasets.

Dataset	Class-All	Class-Train	Class-Val	Class-Test
miniImageNet [8]	100	64	16	20
tieredImageNet [27]	608	351	97	100
StanfordCars [59]	196	130	17	49

Table 2. Few-shot classification results on the miniImageNet.

Model	Type	5-Way 1-Shot	5-Way 5-Shot
Meta-learn Nets [24]	Meta	43.44 ± 0.77	60.60 ± 0.71
MetaGAN [30]	Meta	46.13 ± 1.78	60.71 ± 0.89
MAML [22]	Meta	48.70 ± 1.84	63.11 ± 0.92
Meta Nets [23]	Meta	49.21 ± 0.96	-
Meta SSL [27]	Meta	50.41 ± 1.84	64.39 ± 0.92
MM-Net [26]	Meta	53.37 ± 0.48	66.97 ± 0.35
SNAIL [28]	Meta	55.71 ± 0.99	68.88 ± 0.92
Matching Nets [8]	Metric	43.56 ± 0.84	55.31 ± 0.73
Prototypical Nets [7]	Metric	49.42 ± 0.78	68.20 ± 0.66
GNN [53]	Metric	50.33 ± 0.36	66.41 ± 0.63
Relation Nets [9]	Metric	50.44 ± 0.82	65.32 ± 0.70
DN4 [38]	Metric	51.24 ± 0.74	71.02 ± 0.64
MSDN [16]	Metric	52.59 ± 0.81	68.51 ± 0.69
SoSN [40]	Metric	52.96 ± 0.83	68.63 ± 0.68
RPMN [41]	Metric	53.35 ± 0.78	69.35 ± 0.59
RCN [39]	Metric	53.47 ± 0.84	71.63 ± 0.70
DCN [14]	Metric	53.48 ± 0.78	67.63 ± 0.59
MATANet [19]	Metric	53.63 ± 0.83	72.67 ± 0.76
MSFN [42]	Metric	54.48 ± 1.23	69.06 ± 0.69
PARN [10]	Metric	55.22 ± 0.82	71.55 ± 0.66
MSLPN [20]	Metric	56.52 ± 0.92	73.45 ± 0.86
MsKPRN [11]	Metric	57.02 ± 0.88	72.06 ± 0.68
MRN [54]	Metric	57.83 ± 0.69	71.13 ± 0.50
ABNet [49]	Metric	58.12 ± 0.94	72.02 ± 0.75
DbMRNT(Ours)	Metric	58.82 ± 0.88	74.19 ± 0.63

Table 3. Few-shot classification results on the tieredImageNet.

Model	Type	5-Way 1-Shot	5-Way 5-Shot
MAML [22]	Meta	51.67 ± 1.81	70.30 ± 1.75
Prototypical Network [7]	Metric	53.31 ± 0.89	72.69 ± 0.74
Relation Nets [9]	Metric	54.48 ± 0.93	71.32 ± 0.78
CGRN [52]	Metric	55.07 ± 0.20	71.34 ± 0.30
HMRN [51]	Metric	57.98 ± 0.26	74.70 ± 0.24
MSLPN [20]	Metric	58.69 ± 0.96	74.12 ± 0.73
DCN [14]	Metric	60.58 ± 0.72	72.42 ± 0.69
ABNet [49]	Metric	62.10 ± 0.96	75.11 ± 0.78
MRN [54]	Metric	62.65 ± 0.84	74.20 ± 0.64
DbMRNT(Ours)	Metric	62.95 ± 0.92	79.00 ± 0.70

Table 4. Few-shot classification results on the Stanford Cars.

Model	5-Way 1-Shot	5-Way 5-Shot
Matching Nets [8]	34.80 ± 0.98	44.70 ± 1.03
Prototypical Nets [7]	40.90 ± 1.01	52.93 ± 1.03
Relation Nets [9]	47.67 ± 0.47	60.59 ± 0.40
GNN [53]	55.85 ± 0.97	71.25 ± 0.89
DN4 [38]	61.51 ± 0.44	89.60 ± 0.44
MsKPRN [11]	76.64 ± 0.84	89.88 ± 0.46
DbMRNT(Ours)	78.30 ± 0.83	90.70 ± 0.44

Table 5. The effect of the dual-branch structure on miniImageNet.

Model	5-Way 1-Shot	5-Way 5-Shot
Branch 1	57.11 ± 0.84	72.85 ± 0.61
Branch 2	56.18 ± 0.85	70.06 ± 0.66
B1 + B2	57.48 ± 0.84	73.46 ± 0.64
B1 + B2 + T	57.88 ± 0.85	73.76 ± 0.62

Table 6. The effect of information transfer between layers on miniImageNet.

Model	5-Way 1-Shot	5-Way 5-Shot
Baseline	57.88 ± 0.85	73.76 ± 0.62
B1 + B2 * $w / o$ $Φ^{d - 1}$	58.41 ± 0.84	74.19 ± 0.63
B1 + B2 * + T	58.82 ± 0.88	73.90 ± 0.60

Table 7. The effect of hyperparameter t on miniImageNet.

Model	5-Way 1-Shot	5-Way 5-Shot
$t = 1$	58.71 ± 0.88	73.72 ± 0.61
$t = 7$	58.73 ± 0.85	73.90 ± 0.60
$t = 20$	58.82 ± 0.88	73.66 ± 0.60

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, C.; Wang, H.; Zhang, Y.; Zhou, Z.; Li, G. Dual-Branch Multi-Scale Relation Networks with Tutorial Learning for Few-Shot Learning. Appl. Sci. 2024, 14, 1599. https://doi.org/10.3390/app14041599

AMA Style

Xu C, Wang H, Zhang Y, Zhou Z, Li G. Dual-Branch Multi-Scale Relation Networks with Tutorial Learning for Few-Shot Learning. Applied Sciences. 2024; 14(4):1599. https://doi.org/10.3390/app14041599

Chicago/Turabian Style

Xu, Chuanyun, Hang Wang, Yang Zhang, Zheng Zhou, and Gang Li. 2024. "Dual-Branch Multi-Scale Relation Networks with Tutorial Learning for Few-Shot Learning" Applied Sciences 14, no. 4: 1599. https://doi.org/10.3390/app14041599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Branch Multi-Scale Relation Networks with Tutorial Learning for Few-Shot Learning

Abstract

1. Introduction

2. Related Work

2.1. Meta-Learning-Based Methods

2.2. Data Augmentation-Based Methods

2.3. Metric-Learning-Based Methods

3. Methodology

3.1. Problem Definition

3.2. Model Overview

3.3. Generation of Multi-Scale and Multi-Layer Features

3.4. Relation Networks Module

3.5. Tutoring Learning Module

4. Experiments and Results

4.1. Datasets

4.2. Network Architecture

4.3. Setup

4.4. Results

4.4.1. Results of the Main Datasets

4.4.2. Results of the Fine-Grained Dataset

5. Discussion

5.1. The Effect of the Dual-Branch Structure

5.2. The Effect of Information Transfer between Layers

5.3. The Effect of Hyperparameter t

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI