Lightweight Super-Resolution Reconstruction Vision Transformers of Remote Sensing Image Based on Structural Re-Parameterization

Bian, Jiaming; Liu, Ye; Chen, Jun

doi:10.3390/app14020917

Open AccessArticle

Lightweight Super-Resolution Reconstruction Vision Transformers of Remote Sensing Image Based on Structural Re-Parameterization

by

Jiaming Bian

,

Ye Liu

and

Jun Chen

^*

School of Transportation Science and Engineering, Beihang University, Beijing 102206, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(2), 917; https://doi.org/10.3390/app14020917

Submission received: 1 December 2023 / Revised: 19 January 2024 / Accepted: 19 January 2024 / Published: 21 January 2024

Download

Browse Figures

Versions Notes

Abstract

:

In recent times, remote sensing image super-resolution reconstruction technology based on deep learning has experienced rapid development. However, most algorithms in this domain concentrate solely on enhancing the super-resolution network’s performance while neglecting the equally crucial aspect of inference speed. In this study, we propose a method for lightweight super-resolution reconstruction of remote sensing images, termed SRRepViT. This approach reduces model parameters and floating-point operations during inference through parameter equivalent transformation. Using the RSSOD remote sensing dataset as our benchmark dataset, we compared the reconstruction performance, inference time, and model size of SRRepViT with other classical methods. Compared to the lightweight model ECBSR, SRRepViT exhibits slightly improved reconstruction performance while reducing inference time by 16% and model parameters by 34%, respectively. Moreover, compared to other classical super-resolution reconstruction methods, the SRRepViT model achieves similar reconstruction performance while reducing model parameters by 98% and increasing inference speed by 90% for a single remote sensing image.

Keywords:

remote sensing image; super resolution reconstruction; vision transformers; structural re-parameterization

1. Introduction

Remote sensing images serve as a crucial data source, offering rich texture details essential for various fields, like change detection [1], target recognition [2], and land cover classification [3]. Their significance in the transportation industry cannot be overstated since these images provide indispensable insights for planning, design, construction, and monitoring of infrastructure [4,5,6,7]. Remote sensing images provide essential ground information crucial for various applications such as infrastructure planning, traffic flow analysis, construction progress tracking, facility status monitoring, environmental impact assessment, and rapid response to disasters [8,9]. The accuracy of data obtained from these images is fundamental, offering a scientific basis and critical input for decision-makers. However, several factors can compromise the quality of remote sensing images [10]. These factors range from mechanical errors in the optical imaging system and relative motion between the sensing platform and the ground to atmospheric disturbances, noise, and other variables [11]. While improving the resolution of remote sensing images through hardware upgrades is an option, it is often time-consuming and expensive [12,13]. Nevertheless, the development of integrated positioning, navigation, and time service systems (PNT systems) has catalyzed research into new remote sensing algorithms, particularly for mobile terminals and edge devices [14]. Consequently, specialized algorithms have become essential for the super-resolution reconstruction of low-quality and low-resolution remote sensing images, enabling the acquisition of high-quality visuals.

The traditional super-resolution reconstruction methods encompass interpolation [15], reconstruction [16], and learning-based [17] techniques. Interpolation primarily operates on known grayscale information within low-resolution images. Utilizing interpolation formulas, it enhances grayscale data between pixels to achieve image magnification. The interpolation methods typically include linear and nonlinear interpolation algorithms, such as nearest neighbor interpolation, bicubic interpolation, and wavelet transform interpolation. Reconstruction methods involve utilizing multiple low-resolution images alongside unknown high-resolution images to extract image feature information and are typically categorized into frequency domain and spatial domain techniques. Notable algorithms within this realm include iterative back projection, maximum a posteriori probability, and MAP/POCS methods [18,19,20]. Learning-based methods acquire prior knowledge to guide image reconstruction by learning the mapping relationship between low-resolution and high-resolution images. These methods commonly encompass flow learning and sparse coding. Although these traditional techniques have facilitated the progress of remote sensing image reconstruction, they heavily rely on constraint construction and image alignment accuracy to achieve reconstruction effects and tend to falter in super-resolution reconstruction tasks that involve significant magnification. As a result, the outcomes often manifest issues such as blurred edge textures and inadequate detailing.

In recent years, artificial intelligence techniques leveraging deep learning have seen extensive application in enhancing and rectifying both structured and unstructured data due to their proficiency in artificial feature extraction and screening. Within the domain of remote sensing images, numerous researchers have proposed super-resolution reconstruction models based on convolutional neural networks (CNNs) [21,22,23,24,25,26,27,28,29,30]. Despite their effectiveness in enhancing reconstruction performance, these super-resolution models introduce additional computational complexity, resulting in substantial computing costs and increased memory consumption. A significant issue is that most research on efficient super-resolution design is carried out on GPU servers, which does not accurately reflect performance on mobile devices [31]. This discrepancy presents considerable challenges in remote sensing image processing, where high resource consumption and slow inferencing speeds are critical concerns [32]. Remote sensing image data typically includes extensive ground feature information, demanding significant computing resources and storage space. This can lead to bottlenecks in large-scale processing of remote sensing images, potentially making the technology economically and practically unfeasible [33]. Delays in accessing and analyzing image data can hinder timely access to vital information, such as crop data for farmers and agricultural scientists [34]. It also impacts planning and decision-making in urban planning and land management, reduces efficiency in environmental monitoring and resource management, and adversely affects resource exploration and development [35]. In critical situations like disaster monitoring and emergency rescue, the swift and accurate acquisition of high-resolution images is crucial for informed decision-making [36]. If the super-resolution reconstruction process entails prolonged inferencing times, it could significantly hamper the efficiency of information acquisition, thereby diminishing the effectiveness of decision support.

Hence, to effectively implement the super-resolution reconstruction model in engineering practice, it becomes crucial to explore lightweight network models for achieving rapid and efficient reconstruction of remote sensing images. A lightweight network entails transforming deep learning algorithms to minimize both size and speed while preserving accuracy to the greatest extent possible. Over the past decade, researchers have concentrated on designing lightweight convolutional neural networks, making significant strides in various effective design principles. These include separable convolutions [37], inverted residual bottleneck [38], channel shuffle [39], mixed depth-wise convolution [40], and structural re-parameterization [41]. Consequently, classical models like MobileNets [37], ShuffleNets [39], and RepVGG [41] have emerged.

Concurrently, transformer models have gained increasing attention. Transformers, based on the self-attention mechanism, establish a feature extraction network structure capable of calculating attention across each position in the input sequence to derive global context information. Leveraging the self-attention mechanism empowers transformer models to extract contextual feature representations efficiently during training while adapting to input sequences of varying lengths. Furthermore, vision transformers (ViTs) are developed by introducing numerous efficient design principles that notably enhance network computing efficiency on mobile devices. These lightweight ViT models include Efficient Formers [42,43] and MobileViTs [44,45], exhibiting superior performance and lower latency compared to CNNs when deployed on mobile devices.

To realize the lightweight learning model for the efficient and accurate reconstruction of remote sensing images, a super-resolution reconstruction vision transformer based on structural re-parameterization (SRRepViT) is proposed in this work. The main contributions of our research are as follows:

A transformer is used instead of CNN, and the self-attention mechanism of the transformer is used to extract complex spatial and spectral features of remote sensing images.
The neat topology and structural re-parameterization are adopted to reduce model parameters and speed up model inferencing.

The paper is organized as follows. Section 2 introduces the background and methodology of image super-resolution reconstruction. Section 3 provides the specific architecture of the network model. Section 4 shows the experimental preparation. Section 5 shows the experimental results and analysis. Finally, Section 6 draws conclusions.

2. Related Work

2.1. Image Super-Resolution Reconstruction Based on Deep Learning

The pioneering advancements in image super-resolution reconstruction were spearheaded by Dong et al. [21]. They introduced the super-resolution convolutional neural network (SRCNN), a model that reconstructed high-resolution images by conducting feature extraction and non-linear mapping of input images. Building upon this foundation, Dong et al. [22] expanded the SRCNN model by integrating deconvolution and shared mapping layers, altering the original feature dimension, thus creating the fast super-resolution convolutional neural networks (FSRCNNs). This modification notably accelerated network training. With the evolution of convolutional neural network architectures, Kim et al. [23] introduced the residual network and perceived loss, unveiling the very deep convolutional networks (VDSRs). Tai et al. [24] introduced the deep recursive residual network (DRRN), implementing a multi-path model that comprised local residual learning, global residual learning, and multi-weight recursive learning. This construction led to a deeper network structure that significantly enhanced the super-resolution reconstruction effect. Ledig et al. [25] introduced the super-resolution generative adversarial network (SRGAN) and incorporated a generated countermeasure network into the super-resolution reconstruction process. Wang et al. [26] introduced residual dense blocks, eliminating the batch normalization layer, thereby proposing enhanced super-resolution generative adversarial networks (ESRGANs). Lim et al. [27] introduced the enhanced deep residual networks (EDSRs), removing the BN layer from SRResNet and utilizing the saved space to augment the model’s depth. Zhang et al. [28] designed the residual dense network (RDN), incorporating dense connection blocks based on SRDenseNet and ResNet. Furthermore, Zhang et al. [31] proposed the residual channel attention networks (RCANs), integrating a channel attention mechanism to dynamically readjust channel characteristics, thereby achieving enhanced accuracy and visual effects. Additionally, Mahmoud et al. [30] devised a novel method for multi-frame, super-resolution image reconstruction. Their approach focused on increasing visual information and enhancing automatic machine perception to bolster the effectiveness of multi-frame, super-resolution image reconstruction.

The transformer has recently garnered attention within the computer vision community due to its significant success in natural language processing [46]. A range of transformer-based methodologies have emerged, excelling in advanced visual tasks and showcasing strengths in modeling long-range dependencies [47,48]. Capitalizing on its impressive performance, the vision transformer was also adapted for low-level visual tasks. For instance, the pre-trained image-processing transformer (IPT) introduced a ViT-style network and implemented multitasking pre-training specifically for image processing [49]. Likewise, the image restoration using Swin Transformer (SwinIR) proposed an image restoration converter based on a hierarchical vision transformer using shifted windows [50,51]. The encoder–decoder-based transformer (EDT) furthered the exploration by adopting a self-attention mechanism and a multi-task pre-training strategy, thereby advancing research in stochastic resonance [52]. Additionally, the hybrid attention transformer (HAT) devised by Chen et al. [53] amalgamated channel attention and self-attention, establishing an overlapping cross-attention module to augment cross-window information interaction.

However, the aforementioned models, aiming to enhance super-resolution reconstruction performance, introduce intricate deep network structures. Consequently, these complexities escalate the computational load during network training and inference, rendering them unsuitable for deployment in environments constrained by limited computing resources.

2.2. Network Lightweight

In response to the limitations posed by computing resources, researchers have directed their efforts toward crafting lightweight networks without significantly compromising model performance. One prominent contribution was MobileNets, pioneered by Howard et al. [37,38], which leveraged depth-separable convolutions to construct lightweight neural networks. This approach drastically reduced computational complexity and the number of model parameters. Similarly, Zhang et al. [39,54] introduced ShuffleNet, employing point-by-point group convolution and channel mixing to notably curtail computational overhead while enhancing operational efficiency. In the realm of overscore reconstruction, there is a focus on designing lightweight feature extraction modules to bolster efficiency. Ahn et al. [55] devised the cascading residual network (CARN), utilizing packet convolutions to minimize floating-point operations. Additionally, Ding et al. [41] proposed RepVGG, decomposing normal convolutions into multi-branch blocks, enhancing traditional VGG performance across multiple high-end visual tasks to rival the ResNet series. Addressing real-time super-resolution, Zhang et al. [31] designed a streamlined topology and a re-parameterized edge-oriented convolution block for super resolution (ECBSR) to expedite inferencing speed. Kartikeya et al. [56] employed linear convolutions and collapsible linear blocks to design super-efficient super resolution (SESR) networks, adaptable for folding and merging during inferencing to balance reconstruction quality and inference time. Moreover, Zhang et al. [57] introduced the reparameterization-based lightweight image super-resolution network (RepSCN), incorporating re-parameterized distillation blocks (RepDB), self-calibration distillation blocks (SCDB), and a lightweight coordinate attention mechanism (CAM). This innovation aimed to enhance spatial and channel-level feature representation.

In recent years, researchers have dedicated themselves to optimizing the lightweight nature of ViTs, striving to adapt them for mobile device suitability [58]. Notable efforts include MobileViT, which adopts a hybrid architecture merging lightweight MobileNet blocks with MHSA blocks [43]. MobileViT2 took it a step further by proposing a separable self-attention method, curbing the secondary computational complexity of MHSA [45]. EfficientFormer introduced a dimensionally consistent design paradigm, pushing the boundaries of latency–performance trade-offs for pure ViTs [42]. Meanwhile, MobileFormer devised an architecture that parallels MobileNet and converters via a two-way bridge [43]. Integrating these efficient architecture choices for lightweight ViTs, Wang et al. [59] progressively enhanced the mobility of standard lightweight CNNs, notably MobileNetV3, culminating in the development of RepViT. Despite adopting a MetaFormer structure [60], RepViT is solely composed of convolutional layers, showcasing superior performance and efficiency compared to most advanced lightweight ViTs across various computer vision tasks. Furthermore, Wang et al. [32] proposed a novel lightweight super-resolution architecture called DCTA, tailored for remote sensing applications. This architecture introduces a unique distillation CNN-transformer module (DCTB), ingeniously amalgamating the advantages of CNN and transformer structures in a lightweight manner.

3. Methodology and Model

To achieve lightweight super-resolution reconstruction of remote sensing images, this paper proposes a ViT-based network model (SRRepViT) employing structurally re-parameterized convolutional blocks. The overall architecture and main blocks of the proposed SRRepViT model are depicted in Figure 1, which is designed for high efficiency, flexibility, and suitability in response to lightweight scenarios. The edge-oriented convolution block (ECB) effectively captures image edge details and texture features at both the input and output ends of the network. ECB also employs various paths for feature extraction, encompassing normal 3 × 3 convolution, channel expansion and compression convolutions, as well as first- and second-order spatial derivatives of intermediate features. The core network utilizes RepViT, displaying an excellent balance between performance and computational efficiency compared to advanced lightweight ViTs across diverse visual tasks. The inclusion of a channel attention block (CAB) enhances feature representation by extracting inter-channel correlations within the input feature graph through attention mechanisms. This augmentation effectively elevates network expressiveness and model performance while maintaining computational efficiency. For the final reconstruction module, the pixel-shuffle convolution method is employed to upsample the fused features.

3.1. Edge-Oriented Convolutional Block

The ECB consists of four modules. First, the conventional 3 × 3 convolution is used to ensure the basic functionality. The batch normalization (BN) layer is removed because it hinders the performance of super-resolution reconstruction and may lead to artifacts on the reconstructed high-resolution images. The normal convolution is expressed as:

F_{n} = K_{n} * X + B_{n}

(1)

where

F_{n}, X, K_{n}, and B_{n}

represent the output feature, input feature, weight, and deviation of normal convolution, respectively.

Secondly, the expansion and compression convolution are used to extract more extensive features:

F_{e s} = K_{s} * (K_{e} * X + B_{e}) + B_{s}

(2)

where the weights and deviations of 1 × 1 extended convolution and 3 × 3 compressed convolutions are represented by

K_{e}, B_{e}, K_{s}, and B_{s}

, respectively.

Then, use the predefined edge filters and process the scale factor of each filter to extract horizontal and vertical edge information:

\begin{matrix} F_{s o b} = [(S_{D x} \cdot D_{x}) \otimes (K_{x} * X + B_{x}) + B_{D x}] \\ + [(S_{D y} \cdot D_{y}) \otimes (K_{y} * X + B_{y}) + B_{D y}] \end{matrix}

(3)

The horizontal and vertical Sobel filters are represented by

D_{x}, D_{y}

. The weights and deviations of 1 × 1 convolution of horizontal and vertical branches are represented by

K_{x}, B_{x}, K_{y}, and B_{y}

, respectively.

S_{D_{x}}, B_{D_{x}}, S_{D_{y}}, and B_{D_{y}}

are the scaling parameters and bias with the shape of C × 1 × 1 × 1,

\otimes

and

*

represent depth-wise convolution (DWConv) and normal convolution,

\cdot

indicates channel-wise broadcasting multiplication.

Finally, a Laplace filter is used to extract the second-order edge information:

F_{l a p} = (S_{l a p} \cdot D_{l a p}) \otimes (K_{l} * X + B_{1}) + B_{l a p}

(4)

where the weights and deviations of 1 × 1 convolution are represented by

K_{l}, B_{l}

,

S_{l a p}, and B_{l a p}

are scaling factors and bias of DWConv, respectively.

The output of ECB consists of four components:

F = F_{n} + F_{e s} + F_{s o b} + F_{l a p}

(5)

3.2. RepViT

The primary framework of the network adopts RepViT, which notably enhances the block structure originally found in MobileNetV3-L by segregating the token mixer and the channel mixer. In the original MobileNetV3-L block structure, a 1 × 1 extension convolution precedes a deep convolution and a subsequent 1 × 1 projection layer. These components are connected through residual concatenation. The 1 × 1 expansion convolution and 1 × 1 projection layer foster channel interaction, while the deep convolution encourages the amalgamation of spatial information. In the MobileNetV3 block, the former and latter functionalities correspond to the channel mixer and token mixer, respectively, and are intertwined. Building upon this foundation, RepViT advances the depth convolution, relocating the squeeze-and-excitation module to follow the depth-wise filters. This restructuring is pivotal, as the module relies on spatial information interaction. Consequently, separating the channel mixer and token mixer and integrating the SE layer into cross-blocks across all stages optimizes accuracy gains with minimal additional latency. To enhance performance, a multi-branch topology is introduced for the depth filter during training. The structural layout of the RepViT model is illustrated in Figure 2.

3.3. Channel Attention Block

The utilization of channel attention induces the activation of more pixels by involving global information to compute channel attention weights. Furthermore, research has demonstrated that convolution aids transformers in achieving better visual representation and optimization ease. Consequently, we introduced a convolution block integrated with channel attention to augment the network’s representational capability. The Channel Attention Block (CAB) comprises two standard convolution layers employing Gelu activation alongside channel attention (CA) modules. Within RepViT, there exist two principal processing layers: one dedicated to spatial feature processing, encompassing operations like convolution and pooling, and the other responsible for channel feature processing, managing interactions among feature channels. The CAB is strategically placed within the channel mixer, positioned at the final step of the processing layer sequence. It is responsible for incorporating the channel attention mechanism. Within the CAB, a constant β is used to compress the channel count of the two convolution layers. For an input feature with C channels, the output feature’s channel count after the first convolution layer is compressed to β/C channels. Subsequently, the feature is expanded back to C channels through the second layer. Following this, the channel-wise feature characteristics are adaptively rescaled using the standard CA module. Finally, the weighted feature graph, after being combined with the input feature graph and multiplied by a scaling factor, is considered a component of RepViT’s output.

3.4. Pixel Shuffle

In the image reconstruction phase, the pixel-shuffle convolution up-sampling method is employed. Initially, the low-resolution feature image undergoes convolutional layer processing. This processed feature image is then partitioned into multiple subsets using sub-pixel convolutional layers, with each subset corresponding to a specific pixel position in the output image. These subsets are subsequently rearranged and merged to generate high-resolution feature images. The ReLU activation function is applied to enhance the expressive capabilities of these feature images. Finally, the feature image, processed through sub-pixel convolution and the activation function, represents the predicted output of the required high-resolution image.

3.5. Structural Re-Parameterization

In the inferencing process, both ECB and RepViT can carry out structural re-parameterization. In the ECB module, 1 × 1 convolution can be expanded to 3 × 3 convolution, and the subsequent 3 × 3 convolution can be merged into a single normal convolution with parameters, while the predefined edge filter and Laplace filter can be regarded as 3 × 3 convolution with special sparse constraints on the channel. According to the additivity of convolution, the 3 × 3 convolution obtained by each shunt combination can eventually be added to a normal convolution of 3 × 3. Each 3 × 3 convolution block that has been structurally re-parameterized can be represented as:

F = K_{r e p} * X + B_{r e p}

(6)

The structural re-parameterization process of edge-oriented convolutional block is shown in Figure 3:

While in the training process, RepViT introduces multi-branch topology for deep convolution to improve the performance. In the inferencing process, as shown in Figure 3, the deep convolution multi-branch structure can be merged into a single-branch structure to eliminate the additional computing and memory costs caused by multi-branches. The structural re-parameterization process of RepViT is shown in Figure 4:

4. Experimental Preparation

4.1. Dataset

In our experiments, we utilized the widely recognized remote sensing dataset, RSSOD [61], comprising 1759 manually annotated images. These images boast an average resolution of 856 × 853 and a spatial resolution of 0.05. Following the standard practice of data distribution, we randomly partitioned the dataset into training, validation, and test sets with a ratio of 7:2:1. The comprehensive breakdown of image allocation within the RSSOD dataset is summarized in Table 1.

4.2. Evaluation Index

To evaluate the reconstruction performance of the proposed method, the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are used as evaluation indices in this study, and the definition of these two indices is explained as below. PSNR focuses on the pixel-level differences of the image, while SSIM takes into account the structure and perceived similarity of the image. PSNR and SSIM are highly interpretable and intuitive and have the characteristics of wide application and standardization. The use of these indicators can make the experimental results easier to compare and verify with other studies. The unit of PSNR is expressed in decibels (dB), where a higher PSNR value indicates a better reconstruction performance of the model. The SSIM ranges between 0 and 1, with a value closer to 1 indicating higher similarity between the reconstructed image and the original one.

The PSNR is the ratio between the maximum possible power of an image and the power of corrupting noise that affects the quality of its representation. Given a reference image

I

and a test image

K

, both of size

m \times n

, the PSNR between

I

and

K

is defined by:

M S E = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} {∥ I (i, j) - K (i, j) ∥}^{2}

(7)

P S N R = 10 \cdot \log_{10} (\frac{M A X_{I}^{2}}{M S E}) = 20 \cdot \log_{10} (\frac{M A X_{I}}{\sqrt{M S E}})

(8)

Mean square error (MSE) refers to the mean square error between the reconstructed image and the actual image.

The SSIM is designed by modeling any image distortion as a combination of three factors, which are loss of correlation, luminance distortion, and contrast distortion. The SSIM is defined as:

S S I M (H, H^{'}) = l (H, H^{'}) * c (H, H^{'}) * s (H, H^{'}) = \frac{(2 μ_{H} * μ_{H^{'}} + C_{1}) (2 σ_{H, H^{'}} + C_{2})}{(μ_{H}^{2} + μ_{H^{'}}^{2} + C_{1}) (σ_{H}^{2} + σ_{H^{'}}^{2} + C_{2})}

(9)

where

l

,

c,

and

s

represent the luminance comparison function, contrast comparison function, and structure comparison function, respectively;

H

and

H^{'}

represent the original image and reconstructed image, respectively;

C

is constant;

μ

and

σ

represent the mean and standard deviation of image pixel values, respectively. The detailed derivation of PSNR and SSIM definition can be found in Reference [62].

4.3. Parameter Setting and Experimental Environment

In our experiment, we utilized low-resolution images obtained by downscaling and blurring the original remote sensing images as inputs for reconstructing high-resolution images at a 4× magnification factor. We compared the reconstruction performance and inference speed of our proposed model against several established super-resolution reconstruction methods, including bicubic, EDSR, RCAN, SwinIR, ECBSR, and HAT. Bicubic represents a classical interpolation method, while EDSR and RCAN are CNN-based models, and SwinIR and HAT represent transformer models. ECBSR stands as a typical lightweight re-parameterized model, encompassing a spectrum of traditional and contemporary techniques for remote sensing image reconstruction.

The experiment comprised four training rounds, each spanning 30,000 epochs. The best-performing model from the previous round was employed as the pre-training model for the subsequent round. The {gt_size, rate} set for each training round was as follows:

{144^{2}, 5 \times 10^{- 4}}, {192^{2}, 1 \times 10^{- 4}}, {208^{2}, 5 \times 10^{- 5}},

and

{208^{2}, 5 \times 10^{- 5}}

.

The deep learning framework used in these experiments was TensorFlow 2.7. The hardware training platform was a single GPU computer (CPU: Intel Core [email protected] GHz 16 Core, RAM: 64GB and GPU: Nvidia Geforce RTX 3090).

5. Results and Analysis

5.1. Progressive Training

The progressive training strategy adopted in this study is tailored specifically for remote sensing images. Its iterative approach aids the model in learning image features more effectively, enhancing both the generalization performance and training stability when handling large-scale images. The gradual increment in input image sizes entails dealing with larger and more intricate images. However, larger images often contain additional structural details, potentially leading to challenges like gradient explosion or disappearance during training. Progressive training mitigates these complexities by allowing the model to gradually acclimate to different input image sizes. This gradual adaptation avoids overwhelming the model with the intricacies of large images at the initial stages, thus enhancing training stability. Furthermore, the progressive increase in input image size compels the model to assimilate richer and higher-level features, fostering improved feature learning and enhancing the model’s generalization capability. Moreover, progressive training mitigates the risk of the model converging to a suboptimal solution early in training. This approach facilitates the exploration of the global optimal solution or a solution proximal to the optimal one.

Figure 5 illustrates the variation in PSNR and SSIM across the four-round training process. It is seen that the model exhibits consistent improvement in performance not only within each training round but also demonstrates a coherent progression across various training rounds. These training outcomes underscore the efficacy of the progressive training strategy in consistently enhancing the model’s reconstruction performance.

5.2. Visual Effects Evaluation

In this paper, two groups of representative complex remote sensing images were selected to showcase the testing results of the reconstruction experiment, as shown in Figure 6.

Figure 6 depicts the outcomes of various reconstruction methods applied to remote sensing images. The bicubic method yields images with roughly restored content but lacks intricate texture details, resulting in a blurry appearance. Leveraging the robust feature extraction abilities of CNN and transformer networks, full-size models like EDSR, SwinIR, RCAN, and HAT offer superior reconstruction by capturing finer details and texture changes in the original remote sensing images. In contrast, while the ECBSR model significantly enhances texture features and restores ground object details compared to bicubic, its reconstruction effect doesn’t match the full-size models due to its lightweight network structure. Notably, our proposed SRRepViT model, optimized through structural re-parameterization as a lightweight model, showcases enhanced reconstruction performance compared to ECBSR. It presents finer edge and texture information in the images. Upon visual comparison with other super-resolution reconstruction models, SRRepViT stands on par with EDSR, SwinIR, RCAN, and HAT, slightly outperforming ECBSR in capturing image details.

5.3. Quantitative Evaluation

Table 2 presents the PSNR and SSIM results across different models, providing a quantitative assessment of their reconstruction performance. Additionally, Table 3 showcases a comprehensive comparison of model parameters and inferencing speed, offering a holistic view of the models’ overall capabilities, considering both reconstruction performance and efficiency. PSNR and SSIM evaluation metrics represent the average values across all images.

Table 2 indicates that despite being a lightweight network, the SRRepViT model outperforms the ECBSR model in PSNR and SSIM by 0.3 and 0.015, respectively. Notably, the SRRepViT model exhibits slightly superior reconstruction performance compared to full-size models like EDSR, SwinIR, and RCAN, showing a PSNR lead of 0.02–0.005 and an SSIM lead of 0.01 on the same dataset.

In Table 3, SRRepViT demonstrates significant reductions in FLOPs and parameters compared to ECBSR, with inferencing time notably shorter than all full-size models. The FLOPs, parameters, and inferencing time of SRRepViT are 48%, 21%, and 52% of ECBSR’s, respectively, while maintaining better reconstruction performance. Moreover, when compared to other models with similar reconstruction accuracy, SRRepViT remarkably reduces FLOPs and parameters. Despite a slightly lower super-resolution accuracy than the HAT model, SRRepViT’s FLOPs and parameters stand at a mere 0.7% and 0.6% of HAT’s, aligning well with lightweight model and mobile deployment requirements.

In Table 4, the SRRepViT ablation experiment reveals the ViT module’s substantial impact on enhancing super-resolution reconstruction, increasing PSNR by 0.2, SSIM by 0.01, and concurrently reducing FLOPs and parameters. Incorporating the subsequent channel attention module and further refining pixel shuffle improves reconstruction accuracy but comes with increased FLOPs, parameters, and inference time. However, the use of structural reparameterization significantly slashes FLOPs and parameters by 50%, reducing the time for inferring a single image by 62% while maintaining reconstruction accuracy.

In Figure 7, a comprehensive comparison of network performance in terms of model inference speed, complexity, and super-resolution reconstruction capability is presented. The vertical axis represents super-resolution reconstruction performance (PSNR), where SRRepViT ranks at the forefront among the existing networks, slightly trailing behind HAT. Compared to the HAT model, our proposed SRRepViT model has only 0.7% of the FlOPs and 0.6% of the params of the HAT model, with the inference time per single image being only 0.9% of the HAT model. The horizontal axis denotes model efficiency, displaying SRRepViT’s lead over all other networks in the experiment regarding inferencing speed. Additionally, the size of the model shapes represents the memory footprint, showcasing that SRRepViT possesses the fewest parameters among all experimental networks. Consequently, the proposed SRRepViT network emerges as the most balanced solution, excelling in model inferencing speed, complexity, and super-resolution reconstruction performance simultaneously. This substantiates its efficiency in facilitating effective remote sensing image super-resolution reconstruction.

6. Conclusions

In this research, we aim to address the challenges of high memory usage and prolonged inferencing times in super-resolution reconstruction for remote sensing images. To tackle this, we propose the SRRepViT model by combining ECB and RepViT modules using structural re-parameterization techniques while introducing an additional channel attention module. The objective is to achieve superior super-resolution reconstruction performance while significantly reducing memory footprint and speeding up the inferencing process. Comparing SRRepViT with classic models like EDSR, SwinIR, RCAN, and HAT, our results reveal that SRRepViT maintains similar levels of reconstruction performance, as evidenced by classic evaluation metrics like PSNR and SSIM. Yet, it achieves substantial reductions in model parameters and inferencing time, providing clear texture details in the reconstructed high-resolution remote sensing images. Additionally, compared to other lightweight models, the use of structural re-parameterization in SRRepViT marginally increases inferencing time, while significantly reducing model parameters by 34%. These findings underline that SRRepViT strikes the optimal balance between performance and efficiency in reconstructing remote sensing images, making it promising for deployment in resource-constrained environments. In the future, we will explore a reparameterization strategy for more efficient feature extraction modules, enabling our model to be embedded in edge devices.

Author Contributions

Conceptualization, J.B. and Y.L.; methodology, J.B. and Y.L.; software, J.B. and Y.L.; validation, J.B. and Y.L.; formal analysis, J.B. and Y.L.; investigation, J.B. and Y.L.; resources, J.B. and Y.L.; data curation, J.B. and Y.L.; writing—original draft, J.B.; writing—review & editing, J.B. and J.C.; visualization, J.B. and Y.L.; supervision, J.C.; project administration, J.B., Y.L. and J.C.; funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This project is supported by the National Key R&D Program of China (Grant. No.2021YFB2600300).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, [J.C.], upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bai, T.; Wang, L.; Yin, D.; Sun, K.; Chen, Y.; Li, W.; Li, D. Deep learning for change detection in remote sensing: A review. Geo-Spat. Inf. Sci. 2023, 26, 262–288. [Google Scholar] [CrossRef]
Wang, J.; Liu, H.; Jiang, P.; Wang, Z.; Sui, Q.; Zhang, F. GPRI2Net: A Deep-Neural-Network-Based Ground Penetrating Radar Data Inversion and Object Identification Framework for Consecutive and Long Survey Lines. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5106320. [Google Scholar] [CrossRef]
Xu, Y.; Gong, J.; Huang, X.; Hu, X.; Li, J.; Peng, M. Luojia-HSSR: A high spatial-spectral resolution remote sensing dataset for land-cover classification with a new 3D-HRNet. Geo-Spat. Inf. Sci. 2023, 26, 289–301. [Google Scholar] [CrossRef]
Zhou, G.; Wei, D. Survey and Analysis of Land Satellite Remote Sensing Applied in Highway Transportations Infrastructure and System Engineering. In Proceedings of the IGARSS 2008—2008 IEEE International Geoscience and Remote Sensing Symposium, Boston, MA, USA, 8–11 July 2008; pp. 479–482. [Google Scholar] [CrossRef]
Bridgelall, R.; Rafert, J.B.; Tolliver, D. Hyperspectral applications in the global transportation infrastructure. In Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 739–743. [Google Scholar] [CrossRef]
Yang, L.; Siddiqi, A.; Weck, O.L. Urban Roads Network Detection from High Resolution Remote Sensing. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 7431–7434. [Google Scholar] [CrossRef]
Zheng, S.; Dai, H.; Wang, G.; Miao, L.; Zhang, W. Application of Transportation Superiority in Beijing-Tianjin-Hebei Region Based on High-Resolution Satellite Remote Sensing Data. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 6964–6967. [Google Scholar] [CrossRef]
Gagliardi, V.; Tosti, F.; Ciampoli, L.B.; Battagliere, M.L.; Tapete, D.; D’Amico, F.; Threader, S.; Alani, A.M.; Benedetto, A. Spaceborne Remote Sensing for Transport Infrastructure Monitoring: A Case Study of the Rochester Bridge, UK. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 4762–4765. [Google Scholar] [CrossRef]
Zhang, Y.; Dong, X.; Shang, L.; Zhang, D.; Wang, D. A Multi-modal Graph Neural Network Approach to Traffic Risk Forecasting in Smart Urban Sensing. In Proceedings of the 2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Como, Italy, 22–25 June 2020; pp. 1–9. [Google Scholar] [CrossRef]
Duan, Y.; He, J.; Lu, Y.; Yu, X. Analysis of the Factors Affecting Airborne Digital Sensor Image Quality. IEEE Access 2019, 7, 8018–8027. [Google Scholar] [CrossRef]
Xu, H.; Sun, R.; Zhang, L.; Tang, Y.; Liu, S.; Wang, Z. Influence on Image Interpretation of Band to Band Registration Error in High Resolution Satellite Remote Sensing Imagery. In Proceedings of the 2012 2nd International Conference on Remote Sensing, Environment and Transportation Engineering, Nanjing, China, 1–3 June 2012; pp. 1–4. [Google Scholar] [CrossRef]
Shaw, G.A.; Burke, H.H.K. Spectral imaging for remote sensing. Lincoln Lab. J. 2003, 14, 3–28. [Google Scholar]
Da Silva, E.; Woolliams, E.R.; Picot, N.; Poisson, J.-C.; Skourup, H.; Moholdt, G.; Fleury, S.; Behnia, S.; Favier, V.; Arnaud, L.; et al. Towards Operational Fiducial Reference Measurement (FRM) Data for the Calibration and Validation of the Sentinel-3 Surface Topography Mission over Inland Waters, Sea Ice, and Land Ice. Remote Sens. 2023, 15, 4826. [Google Scholar] [CrossRef]
Prol, F.S.; Ferre, R.M.; Saleem, Z.; Valisuo, P.; Pinell, C.; Lohan, E.S.; Elsanhoury, M.; Elmusrati, M.; Islam, S.; Celikbilek, K.; et al. Position, Navigation, and Timing (PNT) Through Low Earth Orbit (LEO) Satellites: A Survey on Current Status, Challenges, and Opportunities. IEEE Access 2022, 10, 83971–84002. [Google Scholar] [CrossRef]
Zhang, L.; Wu, X. An edge-guided image interpolation algorithm via directional filtering and data fusion. IEEE Trans. Image Process 2006, 15, 2226–2238. [Google Scholar] [CrossRef]
Li, X.; Hu, Y.; Gao, X.; Tao, D.; Ning, B. A Multi-frame Image Super-resolution Method. Signal Process. 2010, 90, 405–414. [Google Scholar] [CrossRef]
Zeng, K.; Lu, T.; Liang, X.; Liu, K.; Chen, H.; Zhang, Y. Face Super-Resolution Via Bilayer Contextual Representation. Signal Process. Image Commun. 2019, 75, 147–157. [Google Scholar] [CrossRef]
Qiu, D.; Cheng, Y.; Wang, X. Gradual Back-Projection Residual Attention Network for Magnetic Resonance Image Super-Resolution. Comput. Methods Programs Biomed. 2021, 208, 106252. [Google Scholar] [CrossRef] [PubMed]
Wang, B.; Chen, X.; Li, J.; Cao, J. An Improved Weighted Projection Onto Convex Sets Method for Seismic Data Interpolation and Denoising. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 228–235. [Google Scholar] [CrossRef]
Jakhetiya, V.; Lin, W.; Jaiswal, S.P.; Guntuku, S.C.; Au, O.C. Maximum a Posterior and Perceptually Motivated Reconstruction Algorithm: A Generic Framework. IEEE Trans. Multimed. 2017, 19, 93–106. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 184–199. [Google Scholar] [CrossRef]
Dong, C.; Chen, C.; Tang, X. Accelerating the Super-resolution Convolutional Neural Network. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 391–407. [Google Scholar]
Kim, J.; Junk, K.; Kyoung, M. Accurate Image Super-resolution Using very Deep Convolutional Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
Tai, Y.; Yang, J.; Liu, X. Image Super-Resolution via Deep Recursive Residual Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 2790–2798. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic Single Image Super-resolution Using a Generative Adversarial Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 4681–4690. [Google Scholar] [CrossRef]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Loy, C.C. Esrgan: Enhanced Super-resolution Generative Adversarial Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 63–69. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Super-resolution. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2472–2481. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar] [CrossRef]
Khattab, M.M.; Zeki, A.M.; Alwan, A.A.; Bouallegue, B.; Matter, S.S.; Ahmed, A.M. A hybrid regularization-based multi-frame super-resolution using bayesian framework. Comput. Syst. Sci. Eng. 2023, 44, 35–54. [Google Scholar] [CrossRef]
Zhang, X.; Zeng, H.; Zhang, L. Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4034–4043. [Google Scholar] [CrossRef]
Wang, Y.; Shao, Z.; Lu, T.; Liu, L.; Huang, X.; Wang, J.; Jiang, K.; Zeng, K. A lightweight distillation CNN-transformer architecture for remote sensing image super-resolution. Int. J. Digit. Earth 2023, 16, 3560–3579. [Google Scholar] [CrossRef]
Xiao, Z.; Liu, Y. Remote sensing image database based on NOSQL database. In Proceedings of the 2011 19th International Conference on Geoinformatics, Shanghai, China, 24–26 June 2011; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, X.; Liu, R.; Gan, F.; Wang, W.; Ding, L.; Yan, B. Evaluation of Spatial-Temporal Variation of Vegetation Restoration in Dexing Copper Mine Area Using Remote Sensing Data. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2013–2016. [Google Scholar] [CrossRef]
Zhang, F.; Chen, J. Ningxia Integrative Geological Information System Based on SQL Server 2008. Geomat. Spat. Inf. Technol. 2011, 34, 83–85. [Google Scholar]
Li, C.; Yuan, X.; Zhang, J.; Du, P.; Mi, L.; Li, Z. Earthquake Damage Monitoring and Assessment Based on High-Resolution Remote Sensing Images-Take Lushan Earthquake as an Example. In Proceedings of the 2018 26th International Conference on Geoinformatics, Kunming, China, 28–30 June 2018; pp. 1–4. [Google Scholar] [CrossRef]
Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Mixconv: Mixed depthwise convolutional kernels. arXiv 2019. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar] [CrossRef]
Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. Efficientformer: Vision transformers at mobilenet speed. Adv. Neural Inf. Process Syst. 2022, 35, 12934–12949. [Google Scholar] [CrossRef]
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobileformer: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5270–5279. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. Separable self attention for mobile vision transformers. arXiv 2022. [Google Scholar] [CrossRef]
Ashish, V.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process Syst. 2017, 30, 6000–6010. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020. [Google Scholar] [CrossRef]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process Syst. 2021, 34, 08810. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12294–12305. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, QC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Li, W.; Lu, X.; Qian, S.; Lu, J. On efficient transformer and image pre-training for low-level vision. arXiv 2021. [Google Scholar] [CrossRef]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating More Pixels in Image Super-Resolution Transformer. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 22367–22377. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Ahn, N.; Kang, B.; Sohn, K.A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 59–64. [Google Scholar] [CrossRef]
Bhardwaj, K.; Milosavljevic, M.; Chalfin, A.; O’Neil, L.; Gope, D.; Matas, R.; Chalfin, A.; Suda, N.; Meng, L.; Loh, D. Collapsible Linear Blocks for Super-Efficient Super Resolution. arXiv 2021. [Google Scholar] [CrossRef]
Zhang, S.; Chen, X.; Huang, X. Lightweight Image Super-Resolution Based on Re-Parameterization and Self-Calibrated Convolution. Comput. Intell. Neurosci. 2022, 2022, 8628402. [Google Scholar] [CrossRef]
Pan, J.; Bulat, A.; Tan, F.; Zhu, X.; Dudziak, L.; Li, H.; Tzimiropoulos, G.; Martinez, B. Edgevits: Competing light-weight CNNS on mobile devices with vision transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin, Germany, 2022; pp. 294–311. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting Mobile CNN From ViT Perspective. arXiv 2023. [Google Scholar] [CrossRef]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10809–10819. [Google Scholar] [CrossRef]
Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote Sensing Image Super-resolution and Object Detection: Benchmark and State of the Art. Expert Syst. Appl. 2022, 197, 116793. [Google Scholar] [CrossRef]
Horé, A.; Ziou, D. Image Quality Metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar] [CrossRef]

Figure 1. SRRepViT model.

Figure 2. RepViT model.

Figure 3. The structural re-parameterization of ECB.

Figure 4. The structural re-parameterization of RepViT.

Figure 5. Variation of PSNR and SSIM during the training.

Figure 6. Comparison of the results of super-resolution reconstruction generated by different methods with real remote sensing images.

Figure 7. Comparison of performance indicators between SRRepViT and other networks.

Table 1. Details of image selection for RSSOD dataset.

Dataset Name	Extracted Patches	Size	Spatial Resolution (m)
ISPRS Potsdam	1368	1000 × 1000	0.05
UC Merced Land-Use	99	256 × 256	0.3048
NWPU-RESISC45	101	256 × 256	0.8
Draper Satellite Image Chronology	11	1000 × 1000	0.2
Ship Images from Google	180	~421 × 388.5	0.8
	Total: 1759	Avg: 856 × 853

Table 2. PSNR and SSIM results of different methods.

Metric	ECBSR	EDSR	SwinSR	RCAN	SRRepViT	HAT
PSNR	30.6891	30.9186	30.9453	30.9379	30.9528	31.1809
SSIM	0.8291	0.8361	0.8377	0.8366	0.8429	0.8434

Table 3. FLOPs, params, and inferencing speeds of different methods.

Metric	ECBSR	EDSR	SwinSR	RCAN	SRRepViT	HAT
params	0.78 M	1.52 M	11.82 M	15.59 M	0.25 M	40.26 M
FLOPs	36.32 G	130.27 G	0.77 T	1.05 T	17.39 G	2.53 T
Speed	0.025 s	0.083 s	0.345 s	0.223 s	0.013 s	1.443 s

Table 4. The Ablation study of SRRepViT.

Metric	PSNR	SSIM	FLOPs	Params	Speed
ECBSR	30.6891	0.8291	36.32 G	0.78 M	0.025 s
+ViT	30.8917	0.8382	17.56 G	0.29 M	0.015 s
+ ViT + attention	30.9176	0.8401	30.02 G	0.49 M	0.020 s
+ ViT + attention + pixel shuffle	30.9528	0.8429	31.44 G	0.51 M	0.021 s
+ ViT + attention + pixel shuffle + re-parameterization	30.9528	0.8429	17.39 G	0.25 M	0.013 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bian, J.; Liu, Y.; Chen, J. Lightweight Super-Resolution Reconstruction Vision Transformers of Remote Sensing Image Based on Structural Re-Parameterization. Appl. Sci. 2024, 14, 917. https://doi.org/10.3390/app14020917

AMA Style

Bian J, Liu Y, Chen J. Lightweight Super-Resolution Reconstruction Vision Transformers of Remote Sensing Image Based on Structural Re-Parameterization. Applied Sciences. 2024; 14(2):917. https://doi.org/10.3390/app14020917

Chicago/Turabian Style

Bian, Jiaming, Ye Liu, and Jun Chen. 2024. "Lightweight Super-Resolution Reconstruction Vision Transformers of Remote Sensing Image Based on Structural Re-Parameterization" Applied Sciences 14, no. 2: 917. https://doi.org/10.3390/app14020917

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Super-Resolution Reconstruction Vision Transformers of Remote Sensing Image Based on Structural Re-Parameterization

Abstract

1. Introduction

2. Related Work

2.1. Image Super-Resolution Reconstruction Based on Deep Learning

2.2. Network Lightweight

3. Methodology and Model

3.1. Edge-Oriented Convolutional Block

3.2. RepViT

3.3. Channel Attention Block

3.4. Pixel Shuffle

3.5. Structural Re-Parameterization

4. Experimental Preparation

4.1. Dataset

4.2. Evaluation Index

4.3. Parameter Setting and Experimental Environment

5. Results and Analysis

5.1. Progressive Training

5.2. Visual Effects Evaluation

5.3. Quantitative Evaluation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI