No-Reference Image Quality Assessment Based on a Multitask Image Restoration Network

Chen, Fan; Fu, Hong; Yu, Hengyong; Chu, Ying

doi:10.3390/app13116802

Open AccessArticle

No-Reference Image Quality Assessment Based on a Multitask Image Restoration Network

¹

Department of Artificial Intelligence, Shenzhen University, Shenzhen 518060, China

²

Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong, China

³

Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, MA 01854, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(11), 6802; https://doi.org/10.3390/app13116802

Submission received: 19 April 2023 / Revised: 30 May 2023 / Accepted: 2 June 2023 / Published: 3 June 2023

(This article belongs to the Special Issue Artificial Neural Network Applications in Pattern Recognition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

When image quality is evaluated, the human visual system (HVS) infers the details in the image through its internal generative mechanism. In this process, the HVS integrates both local and global information about the image, utilizes contextual information to restore the original image information, and compares it with the distorted image information for image quality evaluation. Inspired by this mechanism, a no-reference image quality assessment method is proposed based on a multitask image restoration network. The multitask image restoration network generates a pseudo-reference image as the main task and produces a structural similarity index measure map as an auxiliary task. By mutually promoting the two tasks, a higher-quality pseudo-reference image is generated. In addition, when predicting the image quality score, both the quality restoration features and the difference features between the distorted and reference images are used, thereby fully utilizing the information from the pseudo-reference image. In order to facilitate the model’s ability to extract both global and local features, we introduce a multi-scale feature fusion module. Experimental results demonstrate that the proposed method achieves excellent performance on both synthetically and authentically distorted databases.

Keywords:

no-reference image quality assessment; multitask learning; image restoration; multi-level features

1. Introduction

Images play a pivotal role in conveying valuable information and have gained significant prominence across diverse domains, encompassing advertising, entertainment, medicine, education, and numerous other areas of application. However, factors (e.g., lighting conditions during image acquisition and stability of the transmission path) may result in the loss of original image information, which can seriously affect users’ visual experience and the effectiveness of image usage [1]. The goal of image quality assessment (IQA) is to assess the quality of images using certain objective methods and provide valuable evaluation results. Accurate and efficient IQA methods are of great significance in improving user experience and optimizing the accuracy of vision-related applications. IQA methods can be categorized into three types based on the reference image usage: full-reference IQA (FR-IQA) [2,3,4], reduced-reference IQA (RR-IQA) [5,6], and no-reference IQA (NR-IQA) [7,8,9,10]. NR-IQA has higher practicality, especially in situations where it is difficult to obtain reference images in practical applications. Therefore, NR-IQA has become one of the research hotspots in the field of IQA and has important practical research value [11].

The primary difficulty in NR-IQA lies in the absence of reference images, which renders it infeasible to assess quality by comparing the features of distorted images with those of reference images. However, to perceive distorted images, based on prior knowledge in the brain, the human visual system (HVS) utilizes an internal generative mechanism (IGM) [12,13,14] to reconstruct the reference image as much as possible. Subsequently, the evaluation is performed by analyzing the distinctions between the reconstructed image and the distorted image [15], with greater differences indicating more severe image distortion. To acquire prior knowledge of reference images, some researchers have proposed methods based on generative adversarial networks (GANs) to generate pseudo-reference images [16,17]. Following that, the assessment of image quality is conducted by evaluating the difference between the distorted image and the pseudo-reference image. However, GAN-based methods typically face two challenges. First, the training process of GANs is often unstable, making it difficult to achieve satisfactory image restoration performance. Secondly, when dealing with severely distorted images, GANs may struggle to effectively restore image quality.

To address the aforementioned problems, a NR-IQA method based on a multitask image restoration network (MT-IRN) is proposed in this paper. MT-IRN includes a multitask image restoration sub-network and a score prediction sub-network. The multitask image restoration sub-network is used to generate pseudo-reference images and structural similarity index measure (SSIM) maps between distorted images and reference images, as well as to extract quality restoration features. The score prediction sub-network is used to extract high-level features and multi-scale content features of the distorted image, and difference features between the distorted and pseudo-reference images, and it maps these features to quality scores after concatenation. Specifically, the contributions of this paper are as follows:

First, a multitask image restoration network is introduced to generate high-quality surrogate reference images. The multitask image restoration network has a main task to generate pseudo-reference images and an auxiliary task to generate structurally similar images, with the two tasks mutually reinforcing each other to generate higher-quality pseudo-reference images.
Second, in addition to utilize the feature differences between distorted and pseudo-reference images, quality restoration features are also employed in the model to leverage rich semantic information in image restoration features, enabling the model to exploit not only differences between pseudo-reference and distorted images, but also the semantic information within quality restoration features.
Third, a multi-scale feature fusion module is proposed to fully fuse quality restoration features and multi-scale content features of distorted images, enabling the model to extract both global and local features simultaneously.

The rest of this paper is organized as follows. Section 2 introduces related studies for NR-IQA. Section 3 provides a detailed description of MT-IRN. Section 4 reports the experimental results. Section 5 concludes this paper.

2. Related Studies

Owing to the absence of reference images, numerous conventional NR-IQA approaches concentrate on particular distortion types present in images. For example, filtering-based methods are used to estimate noise in images [18], and sharpness and blurriness estimation algorithms are used to evaluate the quality of blurry images [19]. These methods can achieve higher accuracy when the distortion type is identifiable. In addition, some methods [7,8,9] do not target specific distortion(s), but instead extract general quality features that can describe multiple types of distortion to evaluate the quality. The focus and difficulty of this method lies in selecting which features to measure the level of distortion. This is generally manually extracted through natural scene statistics (NSS) in traditional methods, and it can be automatically learned in deep learning-based methods [20,21,22,23].

Moorthy and Bovik [7] first proposed a blind image quality index (BIQI) for general distortion types. This method fits the wavelet decomposition coefficients of images with a Generalized Gaussian Distribution (GGD) and uses the parameters of the GGD model as features. Mittal et al. [8] proposed a blind/referenceless image spatial quality evaluator (BRISQUE) to utilize NSS in the spatial domain. This approach first computes the multi-scale mean-subtracted contrast-normalized (MSCN) coefficients of distorted images. Then, it fits the MSCN coefficients and their related coefficients to predict quality scores. Ghadiyaram et al. [9] proposed a feature map-based referenceless image quality evaluation engine (FRIQUEE) for authentic distortion assessment. The aim of this method is to capture the statistical consistency or deviations from consistency in authentically distorted images without assuming the presence of a specific distortion type.

With the development and improvement of deep learning, its application in IQA has received increasing attention. Deep learning has powerful fitting and generalization capabilities, enabling it to learn feature representations from large amounts of training data and associate these features with image quality. Therefore, more and more researchers have explored the use of deep learning algorithms to improve the accuracy and reliability of IQA. Kang et al. [20] first proposed an IQA-CNN method that employs convolutional neural networks (CNNs). IQA-CNN utilizes a single convolutional layer for feature extraction and maps the extracted features to quality scores through two fully connected layers. Bosse et al. [21] employed a deeper CNN to extract high-level features from images. The method involves a modified VGG16 and two fully connected layers for feature extraction and score prediction. Results indicate that it can significantly improve the performance of the model using high-level image features. Su et al. [22] proposed HyperIQA for authentically distorted images. This method predicts quality scores based on the perception of content. Zhang et al. [23] proposed a deep bilinear CNN (DB-CNN), which utilizes two streams to extract synthetic distortion features and authentic distortion features from images. Pan et al. [24] proposed an NR-IQA method called blind predicting similar quality map (BPSQM), which consists of a fully convolutional neural network and a pooling network. The global convolutional network is trained using the similarity maps from traditional FR-IQA methods, enabling the network to predict the corresponding similarity quality map for distorted images. The pooling network is then employed to regress the quality map into a quality score. Pan et al. [25] proposed a Distortion-Aware CNN (DACNN) for NR-IQA. DACNN employs a pretraining strategy to extract distinctive features for both synthetic and authentic distortions. These features are then fused using a feature fusion module and mapped to quality scores using a quality prediction module. Liang et al. [26] introduced a context-based approach, employing a graphical model to capture the influence of context on image quality. The model describes the relationship between content and background, allowing for the extraction of context features used for quality score prediction. Zhou et al. [27] proposed a NR-IQA method that utilizes attention mechanisms for feature fusion. They employed VGG19, VGG16, and ResNet-50 to extract the texture, local, and global information of the images, respectively. By incorporating attention mechanisms, they fused the multilevel features to enable perception of different types of distortions. Li et al. [28] also proposed a method called MMMNet, which combines multi-scale and multi-level features for fusion. MMMNet treats the extraction of saliency information as a subtask and leverages a multitask learning mechanism to enhance model performance. Finally, it fuses the multi-scale saliency features with content features to predict quality scores.

In recent years, some novel deep learning methods have achieved significant success in various visual domains. Researchers have also applied these methods to IQA. Zhu et al. applied meta-learning to IQA and proposed MetaIQA [29] and MetaIQA+ [30]. They regarded the model’s handling of different distortions as having different aims and introduced a task selection strategy to enhance the model’s robustness when encountering unseen distortions, thereby improving the model’s generalization capability. Zhang et al. [31] introduced Continual Learning into IQA, enabling the model to learn continuously from a series of datasets. They use three metrics to measure prediction accuracy, adaptability, and robustness. They introduce a prediction head for each new dataset and incorporate a regularization mechanism that enables the evolution of all prediction heads while safeguarding against catastrophic forgetting. Madhusudana et al. [32] proposed CONTRIQUE, a framework that leverages contrastive learning for training on datasets without quality scores. CONTRIQUE incorporates predicting distortion types and levels as auxiliary tasks and employs contrastive learning for pretraining on datasets without quality scores. Subsequently, the feature extraction parameters are frozen, and the score prediction part is trained on the target dataset. Experimental results demonstrate that CONTRIQUE exhibits outstanding generalization performance.

Several GAN-based approaches have been proposed to tackle the issue of a lack of reference images by generating pseudo-reference images. Ren et al. [16] proposed restorative adversarial nets for NR-IQA(RAN4IQA), which include a restorer, a discriminator, and a predictor. The restorer and discriminator collaborate to generate the pseudo-reference images, and subsequently, the predictor extracts features from the distorted and pseudo-reference images and maps them to quality scores. Similarly, Lin et al. [17] proposed a Hall-IQA, which also generates pseudo-reference images using GAN and uses the difference image between the distorted and pseudo-reference images as input to a regression network for quality score prediction. Both RAN4IQA and Hall-IQA have achieved good results, but GAN-based methods have a shaky training process and are difficult to achieve good image restoration performance, especially when faced with severely distorted images. Pan et al. [33] proposed a method based on visual compensation restoration (VCR), named VCRNet, which uses the restoration features to avoid the performance degradation caused by the suboptimal quality of the pseudo-reference image. However, it does not use the differential features between the pseudo-reference and the distorted images.

Despite the notable achievements of the aforementioned approaches, there are still opportunities for further enhancement. Inspired by BPSQM, this paper proposes an NR-IQA method based on a multitask image restoration network, in which the main task is to generate a pseudo-reference image and the auxiliary task is to generate the SSIM map. By leveraging the mutual promotion between the two aims, a high-quality pseudo-reference image can be generated. Furthermore, when predicting quality scores, not only the quality restoration features but also the difference features between the distorted and pseudo-reference images are utilized. This fully exploits the information from the pseudo-reference image. Lastly, a feature fusion module is devised to incorporate multi-scale information, enabling the model to effectively capture both global and local features.

3. Proposed Method

This paper presents a novel NR-IQA method called MT-IRN that utilizes a multitask image restoration network. MT-IRN is comprised of a multitask image restoration sub-network and a score prediction sub-network, as shown in Figure 1. The multitask image restoration sub-network is used to generate pseudo-reference images and SSIM maps between the reference images and distorted images and to extract quality restoration features. This enables the model to not only use the differences between the pseudo-reference and distorted images but also leverage rich semantic information from the quality restoration features. The score prediction sub-network extracts high-level and multi-scale features from the distorted images and difference features between the distorted images and pseudo-reference images.

3.1. Multitask Image Restoration Sub-Network

The quality of pseudo-reference images is crucial for the performance of the model. MT-IRN utilizes a multitask image restoration network to generate higher-quality pseudo-reference images, thereby improving the model’s performance. Inspired by the BPSQM, the multitask image restoration sub-network takes the generation of pseudo-reference images as the main task and the generation of SSIM maps as an auxiliary task, generating higher-quality pseudo-reference images through the mutual promotion of the two goals. The SSIM map provides rich structural information for the reference image, generating the auxiliary task that allows the image restoration sub-network to learn the structural information of the reference image, and thus improving the structural similarity.

MT-IRN employs a U-Net architecture [34,35,36] to construct the image restoration sub-network. U-Net is an encoder–decoder structure widely used in the fields of image segmentation and restoration. Its skip connection structure concatenates low-level features with high-level features, allowing the decoder to retain more detailed information during the upsampling process. The structure of the multitask image restoration sub-network is illustrated in Figure 2. The multitask learning mode adopts a hard sharing mode [37,38], in which the two tasks share the parameters of a shallow layer, and then different convolutional layers are set for each task to achieve their respective goals.

The multitask image restoration sub-network accepts the distorted image as input and generates the pseudo-reference image, the SSIM map, and the quality restoration features

F_{1}

,

F_{2}

, and

F_{3}

as its outputs. The encoder consists of six convolutional modules, E1-E6. E1 is a single-layer convolutional layer composed of 3 × 3 convolution kernels with a stride of 1. To avoid gradient vanishing [39] when deepening the model and to reuse low-level features, E2–E6 use residual convolutional blocks for downsampling. The specific structures of two types of residual blocks are shown in Figure 3. E2 employs a residual block 1 architecture, comprising two 3 × 3 convolutional layers with stride 1. Notably, this architecture preserves the size of the input. For E3–E6, they employ the residual block 2 architecture, which comprises a 3 × 3 convolutional layer with stride 2 and another 3 × 3 convolutional layer with stride 1. Consequently, the output size is reduced by half compared to the input size Since the sizes of the input and output feature maps are inconsistent, a 1 × 1 convolution with a stride of 2 is performed on the input during the residual connection to match their sizes.

The decoder consists of six deconvolutional layers, D1–D6, which perform upsampling on the high-level features of distorted images to generate the pseudo-reference image and SSIM map. Through multi-level skip connections, the decoder effectively preserves the detailed information in the input image while avoiding the loss of feature details caused by pooling layers, thus improving the image restoration performance. D1–D6 are all composed of 3 × 3 deconvolutional layers, where the stride of D1–D4 is 2, and the stride of D5 and D6 is 1. The specific structural parameters of the image restoration sub-network are summarized in Table 1.

3.2. Score Prediction Sub-Network

A score prediction sub-network is used to extract features from distorted and pseudo-reference images and to predict image quality scores, as shown in Figure 4. The score prediction sub-network takes distorted images, pseudo-reference images, and quality restoration features

F_{1}

,

F_{2}

,

F_{3}

as inputs, and a pre-trained ResNet-50 [40] on ImageNet [41] is used as a feature extractor to extract the content features of the image. First, the outputs of Conv2_10, Conv3_12, and Conv4_18 in ResNet-50 are used as multi-scale features of the distorted image, and are, respectively, fully fused with the quality restoration features

F_{1}

,

F_{2}

,

F_{3}

extracted by the image restoration sub-network through the multi-scale features fusion block. Then, the high-level features of the distorted image, namely the output of Conv5_9 in ResNet-50, are subtracted from the high-level features of the pseudo-reference image. After dimension reduction by a 1 × 1 convolution, the difference feature between the distorted and pseudo-reference images is obtained. Finally, the high-level feature, multi-scale feature, and difference feature are globally average-pooled, concatenated, and mapped to quality scores by fully connected layers.

A multi-scale features fusion module is used to effectively fuse the multi-scale content features and the quality restoration features, as shown in Figure 5. Considering that these two features come from different network structures, there may be different feature scales. Therefore, multi-scale convolutions are applied to each feature separately to extract features at various scales, as shown in Equation (1). The multi-scale convolution uses two concatenated 3 × 3 convolutions to achieve a receptive field size of 5 × 5 while reducing the number of parameters.

M C (F_{i}) = {C o n v}_{3 \times 3} (F_{i}) ⨂ {C o n v}_{3 \times 3} ({C o n v}_{3 \times 3} (F_{i})) ⨂ {C o n v}_{1 \times 1} (F_{i}),

(1)

where

M C (\cdot)

denotes the multi-scale convolution operation,

F_{i}

represents the input feature map,

{C o n v}_{3 \times 3} (\cdot)

represents the 3 × 3 convolution layer with stride 1 and padding 1,

{C o n v}_{1 \times 1} (\cdot)

represents the 1 × 1 convolution layer with stride 1, and

⨂

indicates the concatenation operation.

Subsequently, the spatial attention module is utilized to extract prominent spatial features from two different features. The spatial attention module [42] improves the model’s performance by retaining critical information while ignoring unimportant regions. Specifically, we first perform max-pooling and average-pooling operations on the multi-scale convolutional feature maps to generate two 2D feature maps. We concatenate these two feature maps and learn spatial weights through a 5 × 5 convolution operation. Subsequently, the final spatial attention feature map is obtained by performing element-wise multiplication with the input. Finally, the two spatial attention feature maps are concatenated, as shown in Equation (2):

\{\begin{array}{l} F_{M R} = M C (F_{R}) \\ F_{M C} = M C (F_{C}) \\ F_{W R} = {C o n v}_{5 \times 5} (M a x P o o l (F_{M R}) ⨂ A v g P o o l (F_{M R})) \\ F_{W C} = {C o n v}_{5 \times 5} (M a x P o o l (F_{M R}) ⨂ A v g P o o l (F_{M R})) \\ F_{S} = (F_{R} ⨀ F_{W R}) ⨂ (F_{C} ⨀ F_{W C}) \end{array},

(2)

where

F_{R}

denotes the quality restoration feature maps, and

F_{C}

denotes the content feature maps.

{C o n v}_{5 \times 5} (\cdot)

represents the 5 × 5 convolution layer, and

M a x P o o l (\cdot)

and

A v g P o o l (\cdot)

represent the max pooling and average pooling operation, respectively.

⨂

indicates the concatenation operation, and

⨀

indicates the multiple operation.

To further fuse the spatial attention feature maps, a channel attention module [43] is utilized to learn the importance of the concatenated feature maps across different channels to better capture the relationships between different channels. The concatenated feature maps are processed using global average pooling to obtain a one-dimensional vector, and then a fully connected layer is applied to learn the weights of each channel, generating a weight vector. The weight vector is multiplied with the concatenated feature maps to obtain a fully fused output feature, as shown in Equation (3):

\{\begin{array}{l} F_{W S} = G A P (F_{S}) \\ W = F C (F_{W S}) \\ F_{M} = F_{S} ⨀ W \end{array},

(3)

where

G A P (\cdot)

denotes the global average pooling,

F C (\cdot)

denotes the fully connected layer, and

⨀

denotes the multiple operation.

3.3. Network Training

The training stage is comprised of two parts: pre-training of the image restoration sub-network and overall model training. First, we pre-train the multitask image restoration sub-network with the auxiliary task of generating an SSIM map using the Waterloo Exploration Database [44]. We randomly crop 224 × 224 sized image patches from the distorted images to expand the training data, set the learning rate to 0.001 and batch size to 64, and train for 100 epochs using the Adam [45] optimizer. The loss function is the

L_{1}

loss between the generated SSIM map and the ground truth SSIM map, as shown in Equation (4):

l_{s} = \frac{1}{N} \sum_{i = 1}^{N} {‖I_{S S I M}^{(i)} - {\hat{I}}_{S S I M}^{(i)}‖}_{l_{1}},

(4)

where

N

denotes the number of training image patches,

I_{S S I M}^{(i)}

denotes the SSIM map between the ith image patch and its corresponding reference image patch,

{\hat{I}}_{S S I M}^{(i)}

denotes the predicted SSIM map of the ith image patch by the model, and

l_{1}

denotes the

l_{1}

-norm.

After the training of the auxiliary task, the image restoration sub-network has well learned the structural information in the reference images. Based on this, the main task of generating pseudo-reference images can be trained. Training is conducted by randomly cropping 224 × 224 sized image patches from distorted images in the Waterloo Exploration Database. The learning rate is set to 0.0001, the batch size is set to 64, and the Adam optimizer is used to optimize the network for 50 epochs. The loss function is the

L_{1}

loss between the pseudo-reference and the reference images, as shown in Equation (5):

l_{r} = \frac{1}{N} \sum_{i = 1}^{N} ‖I_{r}^{(i)} - {{\hat{I}}_{r}^{(i)}‖}_{l_{1}},

(5)

where

N

denotes the number of training image patches,

I_{r}^{(i)}

denotes the ith reference image patch,

{\hat{I}}_{r}^{(i)}

denotes the ith pseudo-reference image patch generated by the network, and

l_{1}

denotes the

l_{1}

-norm.

After training the image restoration sub-network, the entire model needs to be trained on the target database. During the first 10 epochs of training, the parameters of the image restoration sub-network are frozen, and only the score prediction sub-network is trained. Then, the entire network is finely trained for another 40 epochs. To perform data augmentation, following the strategy from [17,46], the images are randomly horizontally flipped during training, and 5 randomly sampled 224 × 224 image patches are extracted from each image to increase the number of training samples. The quality scores of the image patches are the same as those of the corresponding distorted images. During testing, 5 randomly sampled 224 × 224 image patches are also extracted from each testing image, and their quality scores are predicted. The mean of these scores is used as the quality score of the testing image.

The

L_{1}

loss is used to train the entire model, as shown in Equation (6):

l = \frac{1}{N} \sum_{i = 1}^{N} ‖q_{i} - {{\hat{q}}_{i}‖}_{l_{1}},

(6)

where

N

denotes the number of the image patches,

q_{i}

denotes the score of ith image patch predicted by the model, and

{\hat{q}}_{i}

denotes the ground truth score of ith image patch.

The Adam optimizer is used to optimize the parameters with a weight decay rate of 5 × 10⁻⁴. The model is trained for 50 epochs with a batch size of 48 and an initial learning rate of 5 × 10⁻⁵. The learning rate is multiplied by 0.9 every 10 epochs during training. MT-IRN is implemented with Pytorch, and the experiments are conducted on NVIDIA 3080Ti GPU.

4. Experiments

4.1. Databases and Experimental Protocols

To assess the performance of MT-IRN, comprehensive experiments are performed on both synthetic and authentic databases containing various types of distortions. The synthetically distorted databases include LIVE [47], CSIQ [48], TID2013 [49], and KADID-10k [50], while the authentically distorted databases include LIVEC [51] and KonIQ-10k [52]. The details of the databases are summarized in Table 2.

For synthetically distorted databases, 80% of images are used as a training set and the remaining 20% as a testing set, divided by the reference images, to avoid image content overlap. For authentically distorted databases, the training and testing sets are directly divided into 80% and 20% proportions. For each dataset, the random partitioning process is repeated 10 times according to the aforementioned rules, and the median of the results from the 10 experiments is taken as the final result. Spearman’s rank correlation coefficient (SROCC) and Pearson’s linear correlation coefficient (PLCC) are used to evaluate the performance of MT-IRN. The SROCC measures the monotonicity between the predicted scores and the ground truth scores, while the PLCC measures the linear correlation between them [10]. Both SROCC and PLCC have a range of [−1, 1], with a larger absolute value indicating better performance of the model.

4.2. Performance on Individual Database

The experiments on individual databases utilize four synthetically distorted databases, including LIVE, CSIQ, TID2013, and KADID, as well as two authentically distorted databases, LIVEC and KonIQ. The results are summarized in Table 3 and Table 4. The methods compared with MT-IRN include three traditional methods (PSNR, SSIM [2], and BRISQUE [13]), seven deep learning-based methods (IQA-CNN [20], BIECON [53], MEON [54], DIQaM-NR [21], HyperIQA [22], DB-CNN [23], and TS-CNN [55]), two GAN-based methods (RAN4IQA [16] and Hall-IQA [17]), and a visual compensation restoration-based method, namely VCRNet [33].

From the experimental results in Table 3 and Table 4, it can be observed that MT-IRN outperforms the traditional methods on all six databases. This is mainly due to the powerful learning ability of deep learning, which enables the model to extract richer features. Compared with deep learning-based methods, our method performs better than most methods on the synthetically distorted databases, except DB-CNN on the CSIQ dataset. On the authentically distorted databases, our method performs slightly lower than Hyper-IQA, but it still achieves a better performance than the GAN-based RAN4IQA. Particularly, on the LIVEC, our method’s SROCC is approximately 45.9% higher than that of RAN4IQA. This is mainly because RAN4IQA is pre-trained on the synthetically distorted databases, while our method’s score prediction sub-network uses ResNet-50, which is pre-trained on ImageNet and has a stronger ability to extract features from authentic distortions. Compared with the visual compensation restoration-based VCRNet, our method lags behind LIVE and CSIQ, but it still maintains a leading performance on the other four databases. This is because our method not only uses quality restoration features but also utilizes the difference features between the distorted and pseudo-reference images. Additionally, the quality of the pseudo-reference image is improved by a multitask restoration network, which further enhances the model’s performance.

4.3. Performance on Individual Distortion Types

To compare the performance of MT-IRN with the state-of-the-art methods on specific types of distortions, experiments are conducted on LIVE, CSIQ and TID2013. The SROCC results of the experiments are summarized in Table 5.

From Table 5, it can be observed that MT-IRN performs well in the experiments conducted on individual distortion types of the LIVE. MT-IRN utilizes a multitask image restoration subnetwork, which generates higher-quality pseudo-reference images. It achieves the best performance among the VCRNet on three distortion types, namely JP2K, JPEG, and FF, and obtains the second-best performance on the GB distortion. Although MT-IRN does not achieve the top two results on the WN distortion, it still outperforms the RAN4IQA algorithm and other deep learning-based algorithms. Overall, MT-IRN demonstrates outstanding performance on all five distortion types in the LIVE and exhibits consistent performance across various distortion types without any apparent weaknesses. Particularly, it excels in handling JP2K, JPEG, and FF distortions, surpassing all other methods. These results indicate that MT-IRN is not only effective in handling various distortion types but also possesses robustness and stability.

Regarding the CSIQ, MT-IRN achieves the best performance on JP2K, JPEG, and PN. Most methods struggle to achieve an SROCC of 0.900 on the CC distortion, whereas MT-IRN achieves an SROCC of 0.906, second only to VCRNet. Considering the individual distortion types on the CSIQ dataset, MT-IRN achieves the top two results on five out of the six distortion types, with three of them attaining the best performance, surpassing other methods. MT-IRN consistently achieves the top two performances in addressing the PN and CC distortion types, which are known to be challenging and not included in the Waterloo Exploration dataset.

For the TID2013 dataset, MT-IRN achieves the top two performances in 13 out of the 24 distortion types. Remarkably, even in complex distortion types such as NPN, BW, MS, CC, and CCS, where most methods struggle to produce satisfactory results, MT-IRN consistently delivers excellent outcomes. Specifically, MT-IRN achieves SROCCs of 0.596, 0.728, 0.542, 0.786, and 0.719 on these distortion types. In contrast, GAN-based RAN4IQA and Hall-IQA exhibit poor performance on these distortion types, failing to achieve an SROCC of 0.500. This is primarily attributed to the challenges faced by GAN-based methods in generating high-quality pseudo-reference images for severely distorted images, thereby affecting model performance. In contrast, MT-IRN not only extracts features from the pseudo-reference images but also leverages the image restoration features obtained during the restoration process. This enables the model to extract more useful prior information from both the image restoration features and the differential features when confronted with heavily distorted images. Additionally, the utilization of multiple scales of distortion features enhances the model’s performance.

Overall, MT-IRN achieves the top two performances on 22 out of 35 distortion types, outperforming other methods. This demonstrates good performance for specific distortion types, even when facing relatively complex distortions. This is mainly due to the multitask restoration network used in our method, which improves the quality of the generated pseudo-reference images through the mutual promotion of the two tasks. In addition, our method uses not only differential features but also quality restoration features, enabling the score prediction sub-network to make more comprehensive and accurate predictions of image quality scores.

4.4. Performance across Different Databases

Table 6 presents the SROCC results of the cross-database test on LIVE, CSIQ, TID2013, and LIVEC to test the generalization performance of MT-IRN and compare it with the state-of-the-art methods.

Overall, in 12 tests, MT-IRN achieves the top two performances in 11 of them, outperforming the other methods. When a cross-database test is conducted on the synthetically distorted databases of LIVE, CSIQ, and TID2013, most methods achieve good performance as the distortion types are relatively similar among the databases. However, TID2013 contains more complex distortion types, and the performance of the model will decline when it is tested on this dataset. Nevertheless, MT-IRN still achieves the highest SROCC, demonstrating its good generalization performance. When a cross-database test between synthetically and authentically distorted databases is conducted, many methods struggle to achieve good performance. However, MT-IRN achieves the top two performances in all the tests, surpassing other deep learning-based and GAN-based methods.

4.5. Ablation Experiments

To evaluate the impact of each module on the performance of MT-IRN, ablation experiments are conducted on the LIVE, CSIQ, and LIVEC databases, and the results are summarized in Table 7.

First, the score prediction sub-network with only distorted images as input is used as the baseline, and its performance is the worst, with SROCCs of 0.950, 0.894, and 0.820 on the three databases, respectively. Then, the single-task image quality restoration sub-network is added, and the image restoration features are directly concatenated with the multi-scale content features of the distorted images. At this point, the model is able to utilize some information from pseudo-reference images, resulting in an improvement in performance, with SROCC increasing by 0.008, 0.013, and 0.013, respectively. Next, the multitask image quality restoration sub-network is used, but only the image restoration features are used. The quality of the pseudo-reference images is improved, resulting in an improvement in model performance, with SROCC increasing by 0.003, 0.013, and 0.009, respectively. Then, the image difference feature is introduced, allowing the model to more fully utilize the information from the pseudo-reference images, resulting in further improvements in SROCCs of 0.004, 0.006, and 0.005, respectively. Finally, the multi-scale feature fusion module is introduced, allowing for the full fusion of image multi-scale content features and restoration features, and the model’s performance reaches its best, with SROCCs improved by 0.004, 0.008, and 0.005, respectively.

Figure 6 illustrates the performance of the model on three datasets as different modules are progressively integrated into the baseline. The figure demonstrates that as the multitask restoration network, difference feature, and multi-scale feature fusion module are successively incorporated into the model, its performance consistently improves. Notably, the inclusion of the multitask restoration network results in the most significant enhancement in the model’s performance.

In summary, the multitask image restoration sub-network, image restoration features, image difference feature, and multi-scale feature fusion module proposed in this paper can effectively improve the model’s performance, as evidenced by the results of the above experiments.

4.6. Performance of Image Restoration

To evaluate the performance of the image restoration subnetwork, image restoration experiments are conducted on the LIVE, CSIQ, and TID2013 datasets. The average PSNR and average SSIM of the distorted and pseudo-reference images are used to assess the restoration performance. The performance of single-task and multitask image restoration networks is tested separately, and the experimental results are summarized in Table 8.

From Table 8, it can be observed that the average PSNR and average SSIM of the multitask-generated reference images are higher than those of the single-task-generated reference images and distorted images on all three datasets. This suggests that the multitask image restoration network exhibits superior quality restoration performance, and the introduction of multitasking effectively enhances the quality of generated reference images.

Figure 7 shows a comparison between single-task-generated pseudo-reference images and multitask-generated pseudo-reference images. From Figure 7, it can be visually observed that the multitask-generated pseudo-reference images exhibit significant advantages in terms of visual perceptual quality. Compared with the single-task-generated pseudo-reference images and distorted images, the image quality of the multitask-generated pseudo-reference images is closer to the reference image. This suggests that in the framework of multitask learning, the generation of reference images can better restore the visual quality and perceptual details of images, thereby improving the quality and usability of pseudo-reference images.

4.7. Computational Complexity and Cost

To validate the complexity and computational costs of MT-IRN, we compared its number of parameters and computation time with other methods, as shown in Table 9. It can be observed that MT-IRN has 35.30 M parameters, which is higher than that of other methods. This is primarily because we employed a pre-trained ResNet-50, which enhances the model’s feature extraction capability but also introduces more parameters. The average processing time of MT-IRN is 0.31 s, which is competitive compared to other methods and lower than the GAN-based Hall-IQA. In summary, MT-IRN demonstrates competitive performance in terms of complexity and computation time.

5. Conclusions

Inspired by the IGM, this paper proposes a novel NR-IQA method called MT-IRN. MT-IRN is comprised of a multitask image restoration sub-network and a score prediction sub-network. First, a multitask image restoration sub-network is employed to generate a higher-quality pseudo-reference image and extract quality restoration features. Second, a score prediction sub-network is used to extract the high-level features and difference features, and then fuse the multi-scale features with the quality restoration features in the image restoration sub-network. Finally, the high-level features, the fused multi-scale features, and the difference features are utilized for quality score prediction. Experimental results on commonly used datasets demonstrate that MT-IRN has achieved performance comparable to the state-of-the-art methods.

MT-IRN has achieved excellent performance, but there is still room for further improvement. For instance, MT-IRN is pre-trained on the Waterloo Exploration dataset, which only contains four types of distortions. This limitation can result in decreased performance when the model encounters other distortions. Additionally, the use of ResNet-50 as the backbone network enhances performance but also increases computational complexity. Therefore, future research may focus on constructing larger-scale datasets that encompass a broader range of distortion types to support model pre-training. Moreover, considering the utilization of other, more efficient models to reduce computational complexity while achieving similar performance is worth exploring. Furthermore, exploring the simultaneous utilization of both synthetic and authentic distortion datasets for pre-training is a meaningful research direction.

Author Contributions

Conceptualization, F.C.; methodology, F.C.; software, F.C.; validation, F.C.; formal analysis, F.C.; investigation, F.C.; resources, F.C.; data curation, F.C.; writing—original draft preparation, F.C.; writing—review and editing, H.F., H.Y. and Y.C.; visualization, F.C.; supervision, H.F., H.Y. and Y.C.; project administration, Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Stabilization Support Plan for Shenzhen Higher Education Institutions, grant number 20200812165210001.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rehman, A.; Kai, Z.; Zhou, W. Display device-adapted video quality-of-experience assessment. In Human Vision and Electronic Imaging XX; SPIE: Washington, DC, USA, 2015; Volume 9394. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; IEEE: New York, NY, USA, 2003; Volume 2, pp. 1398–1402. [Google Scholar]
Zhang, L.; Zhang, L.; Mou, X.; Zhang, D. FSIM: A feature similarity index for image quality assessment. IEEE Trans. Image Process. 2011, 20, 2378–2386. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, M.; Gu, K.; Zhai, G.; Le Callet, P.; Zhang, W. Perceptual reduced-reference visual quality assessment for contrast alteration. IEEE Trans. Broadcast. 2016, 63, 71–81. [Google Scholar] [CrossRef]
Wu, J.; Liu, Y.; Shi, G.; Lin, W. Saliency change based reduced reference image quality assessment. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; IEEE: New York, NY, USA, 2017; pp. 1–4. [Google Scholar]
Moorthy, A.K.; Bovik, A.C. A two-step framework for constructing blind image quality indices. IEEE Signal Process. Lett. 2010, 17, 513–516. [Google Scholar] [CrossRef]
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef] [PubMed]
Ghadiyaram, D.; Bovik, A.C. Perceptual quality prediction on authentically distorted images using a bag of features approach. J. Vis. 2017, 17, 32. [Google Scholar] [CrossRef] [Green Version]
Zhang, L.; Zhang, L.; Bovik, A.C. A feature-enriched completely blind image quality evaluator. IEEE Trans. Image Process. 2015, 24, 2579–2591. [Google Scholar] [CrossRef] [Green Version]
Wang, Z.; Bovik, A.C. Reduced-and no-reference image quality assessment. IEEE Signal Process. Mag. 2011, 28, 29–40. [Google Scholar] [CrossRef]
Friston, K.; Kilner, J.; Harrison, L. A free energy principle for the brain. J. Physiol. -Paris 2006, 100, 70–87. [Google Scholar] [CrossRef] [Green Version]
Friston, K. The free-energy principle: A unified brain theory? Nat. Rev. Neurosci. 2010, 11, 127–138. [Google Scholar] [CrossRef]
Knill, D.C.; Pouget, A. The Bayesian brain: The role of uncertainty in neural coding and computation. Trends Neurosci. 2004, 27, 712–719. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Lin, W.; Ma, L.; Zhang, Y.; Fang, Y.; Ngan, K.N.; Yan, Y. Free-energy principle inspired video quality metric and its use in video coding. IEEE Trans. Multimed. 2016, 18, 590–602. [Google Scholar] [CrossRef]
Ren, H.; Chen, D.; Wang, Y. RAN4IQA: Restorative adversarial nets for no-reference image quality assessment. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–3 February 2018; Volume 32. [Google Scholar]
Lin, K.Y.; Wang, G. Hallucinated-IQA: No-reference image quality assessment via adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 732–741. [Google Scholar]
Joshi, P.; Prakash, S. Continuous wavelet transform based no-reference image quality assessment for blur and noise distortions. IEEE Access 2018, 6, 33871–33882. [Google Scholar] [CrossRef]
Li, L.; Yan, Y.; Lu, Z.; Wu, J.; Gu, K.; Wang, S. No-reference quality assessment of deblurred images based on natural scene statistics. IEEE Access 2017, 5, 2163–2171. [Google Scholar] [CrossRef]
Kang, L.; Ye, P.; Li, Y.; Doermann, D. Convolutional neural networks for no-reference image quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1733–1740. [Google Scholar]
Bosse, S.; Maniry, D.; Müller, K.R.; Wiegand, T.; Samek, W. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Trans. Image Process. 2017, 27, 206–219. [Google Scholar] [CrossRef] [Green Version]
Su, S.; Yan, Q.; Zhu, Y.; Zhang, C.; Ge, X.; Sun, J.; Zhang, Y. Blindly assess image quality in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3667–3676. [Google Scholar]
Zhang, W.; Ma, K.; Yan, J.; Deng, D.; Wang, Z. Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Trans. Circuits Syst. Video Technol. 2018, 30, 36–47. [Google Scholar] [CrossRef] [Green Version]
Pan, D.; Shi, P.; Hou, M.; Ying, Z.; Fu, S.; Zhang, Y. Blind predicting similar quality map for image quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6373–6382. [Google Scholar]
Pan, Z.; Zhang, H.; Lei, J.; Fang, Y.; Shao, X.; Ling, N.; Kwong, S. Dacnn: Blind image quality assessment via a distortion-aware convolutional neural network. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7518–7531. [Google Scholar] [CrossRef]
Liang, Z.; Lu, W.; Zheng, Y.; He, W.; Yang, J. The context effect for blind image quality assessment. Neurocomputing 2023, 521, 172–180. [Google Scholar] [CrossRef]
Zhou, M.; Lang, S.; Zhang, T.; Liao, X.; Shang, Z.; Xiang, T.; Fang, B. Attentional feature fusion for end-to-end blind image quality assessment. IEEE Trans. Broadcast. 2022, 69, 144–152. [Google Scholar] [CrossRef]
Li, F.; Zhang, Y.; Cosman, P.C. MMMNet: An end-to-end multi-task deep convolution neural network with multi-scale and multi-hierarchy fusion for blind image quality assessment. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4798–4811. [Google Scholar] [CrossRef]
Zhu, H.; Li, L.; Wu, J.; Dong, W.; Shi, G. MetaIQA: Deep meta-learning for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14143–14152. [Google Scholar]
Zhu, H.; Li, L.; Wu, J.; Dong, W.; Shi, G. Generalizable no-reference image quality assessment via deep meta-learning. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1048–1060. [Google Scholar] [CrossRef]
Zhang, W.; Li, D.; Ma, C.; Zhai, G.; Yang, X.; Ma, K. Continual learning for blind image quality assessment. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2864–2878. [Google Scholar] [CrossRef] [PubMed]
Madhusudana, P.C.; Birkbeck, N.; Wang, Y.; Adsumilli, B.; Bovik, A.C. Image quality assessment using contrastive learning. IEEE Trans. Image Process. 2022, 31, 4149–4161. [Google Scholar] [CrossRef]
Pan, Z.; Yuan, F.; Lei, J.; Fang, Y.; Shao, X.; Kwong, S. VCRNet: Visual compensation restoration network for no-reference image quality assessment. IEEE Trans. Image Process. 2022, 31, 1613–1627. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18; Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Ren, W.; Liu, S.; Ma, L.; Xu, Q.; Xu, X.; Cao, X.; Yang, M.H. Low-light image enhancement via a deep hybrid network. IEEE Trans. Image Process. 2019, 28, 4364–4375. [Google Scholar] [CrossRef] [PubMed]
Pan, Z.; Yuan, F.; Lei, J.; Li, W.; Ling, N.; Kwong, S. MIEGAN: Mobile image enhancement via a multi-module cascade neural network. IEEE Trans. Multimed. 2021, 24, 519–533. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586–5609. [Google Scholar] [CrossRef]
Liu, S.; Johns, E.; Davison, A.J. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1871–1880. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Ma, K.; Duanmu, Z.; Wu, Q.; Wang, Z.; Yong, H.; Li, H.; Zhang, L. Waterloo exploration database: New challenges for image quality assessment models. IEEE Trans. Image Process. 2016, 26, 1004–1016. [Google Scholar] [CrossRef] [PubMed]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Kim, J.; Zeng, H.; Ghadiyaram, D.; Lee, S.; Zhang, L.; Bovik, A.C. Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment. IEEE Signal Process. Mag. 2017, 34, 130–141. [Google Scholar] [CrossRef]
Sheikh, H.R.; Sabir, M.F.; Bovik, A.C. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 2006, 15, 3440–3451. [Google Scholar] [CrossRef] [PubMed]
Larson, E.C.; Chandler, D.M. Most apparent distortion: Full-reference image quality assessment and the role of strategy. J. Electron. Imaging 2010, 19, 011006. [Google Scholar]
Ponomarenko, N.; Jin, L.; Ieremeiev, O.; Lukin, V.; Egiazarian, K.; Astola, J.; Kuo, C.C.J. Image database TID2013: Peculiarities, results and perspectives. Signal Process. Image Commun. 2015, 30, 57–77. [Google Scholar] [CrossRef] [Green Version]
Lin, H.; Hosu, V.; Saupe, D. KADID-10k: A large-scale artificially distorted IQA database. In Proceedings of the 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), Berlin/Heidelberg, Germany, 5–7 June 2019; IEEE: New York, NY, USA, 2019; pp. 1–3. [Google Scholar]
Ghadiyaram, D.; Bovik, A.C. Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Process. 2015, 25, 372–387. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hosu, V.; Lin, H.; Sziranyi, T.; Saupe, D. KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Trans. Image Process. 2020, 29, 4041–4056. [Google Scholar] [CrossRef] [Green Version]
Kim, J.; Lee, S. Fully deep blind image quality predictor. IEEE J. Sel. Top. Signal Process. 2016, 11, 206–220. [Google Scholar] [CrossRef]
Ma, K.; Liu, W.; Zhang, K.; Duanmu, Z.; Wang, Z.; Zuo, W. End-to-end blind image quality assessment using deep neural networks. IEEE Trans. Image Process. 2017, 27, 1202–1213. [Google Scholar] [CrossRef]
Yan, Q.; Gong, D.; Zhang, Y. Two-stream convolutional networks for blind image quality assessment. IEEE Trans. Image Process. 2018, 28, 2200–2211. [Google Scholar] [CrossRef]
Kim, J.; Nguyen, A.D.; Lee, S. Deep CNN-based blind image quality predictor. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 11–24. [Google Scholar] [CrossRef]
Moorthy, A.K.; Bovik, A.C. Blind image quality assessment: From natural scene statistics to perceptual quality. IEEE Trans. Image Process. 2011, 20, 3350–3364. [Google Scholar] [CrossRef]
Ye, P.; Kumar, J.; Kang, L.; Doermann, D. Unsupervised feature learning framework for no-reference image quality assessment. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: New York, NY, USA, 2012; pp. 1098–1105. [Google Scholar]
Xu, J.; Ye, P.; Li, Q.; Du, H.; Liu, Y.; Doermann, D. Blind image quality assessment based on high order statistics aggregation. IEEE Trans. Image Process. 2016, 25, 4444–4457. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The architecture of the MT-IRN.

Figure 2. The architecture of the multitask image restoration sub-network.

Figure 3. The structures of the residual blocks.

Figure 4. The architecture of the score prediction sub-network.

Figure 5. The structure of the multi-scale feature fusion module.

Figure 6. The bar chart of the ablative experiments, where MTRN represents the multitask restoration network, DF represents the difference feature, and MSFFM represents the multi-scale feature fusion module.

Figure 7. Comparison of pseudo-reference image quality. From top to bottom, each row represents the reference image, distorted image, single-task restoration image, and multitask restoration image, respectively.

Table 1. The structural parameters of the image restoration sub-network, where W × H × C denotes the width, height, and channels of the feature map, respectively.

	Module	Layer	Input Size	Output Size
Encoder	E1	Conv3 × 3, s1	W × H × 3	W × H × 16
	E2	Conv3 × 3, s1	W × H × 16	W × H × 16
	E2	Conv3 × 3, s1	W × H × 16	W × H × 16
	E3	Conv3 × 3, s2	W × H × 16	W/2 × H/2 × 32
	E3	Conv3 × 3, s1	W × H × 16	W/2 × H/2 × 32
	E4	Conv3 × 3, s2	W/2 × H/2 × 32	W/4 × H/4 × 64
	E4	Conv3 × 3, s1	W/2 × H/2 × 32	W/4 × H/4 × 64
	E5	Conv3 × 3, s2	W/4 × H/4 × 64	W/8 × H/8 × 128
	E5	Conv3 × 3, s1	W/4 × H/4 × 64	W/8 × H/8 × 128
	E6	Conv3 × 3, s2	W/8 × H/8 × 128	W/16 × H/16 × 256
	E6	Conv3 × 3, s1	W/8 × H/8 × 128	W/16 × H/16 × 256
Decoder	D1	Deconv3 × 3, s2	W/16 × H/16 × 256	W/8 × H/8 × 128
	D2	Deconv3 × 3, s2	W/8 × H/8 × 256	W/4 × H/4 × 64
	D3	Deconv3 × 3, s2	W/4 × H/4 × 128	W/2 × H/2 × 32
	D4	Deconv3 × 3, s2	W/2 × H/2 × 64	W × H × 16
	D5	Deconv3 × 3, s1	W × H × 16	W × H × 3
	D6	Deconv3 × 3, s1	W × H × 16	W × H × 1

Table 2. Details of the IQA databases.

Database	Ref. Imgs	Dist. Imgs	Dist. Types	Score’s Type
LIVE [39]	29	779	5	DMOS
CSIQ [40]	30	866	6	DMOS
TID2013 [41]	25	3000	24	MOS
KADID-10k [42]	81	10,125	25	DMOS
LIVEC [43]	/	1162	/	MOS
KonIQ-10k [44]	/	10,073	/	MOS

Table 3. The SROCC and PLCC results on synthetically distorted databases. The top two results are shown in bold font.

Method	LIVE		CSIQ		TID2013		KADID
Method	SROCC	PLCC	SROCC	PLCC	SROCC	PLCC	SROCC	PLCC
PSNR	0.866	0.856	0.806	0.800	0.636	0.706	0.674	0.681
SSIM [2]	0.913	0.931	0.876	0.861	0.637	0.691	0.783	0.780
BRISQUE [13]	0.940	0.942	0.746	0.829	0.604	0.694	0.519	0.554
IQA-CNN [20]	0.956	0.953	0.876	0.905	0.701	0.752	0.651	0.607
BIECON [53]	0.961	0.962	0.825	0.838	0.717	0.762	0.685	0.691
MEON [54]	0.943	0.954	0.839	0.850	0.828	0.811	0.813	0.822
DIQaM-NR [21]	0.960	0.972	0.901	0.908	0.835	0.855	0.840	0.843
HyperIQA [22]	0.962	0.966	0.923	0.942	0.840	0.858	0.852	0.845
DB-CNN [23]	0.968	0.971	0.946	0.959	0.816	0.865	0.801	0.806
TS-CNN [55]	0.969	0.978	0.892	0.905	0.779	0.784	0.745	0.744
RAN4IQA [16]	0.961	0.962	0.914	0.938	0.820	0.859	/	/
Hall-IQA [17]	0.976	0.978	0.892	0.906	0.879	0.880	/	/
VCRNet [33]	0.973	0.974	0.943	0.955	0.846	0.846	0.850	0.857
MT-IRN	0.969	0.970	0.928	0.943	0.852	0.877	0.877	0.878

Table 4. The SROCC and PLCC results on authentically distorted databases. The top two results are shown in bold font.

Method	LIVEC		KonIQ
Method	SROCC	PLCC	SROCC	PLCC
BRISQUE [13]	0.607	0.585	0.673	0.692
IQA-CNN [20]	0.516	0.536	0.655	0.671
BIECON [53]	0.595	0.613	0.618	0.651
MEON [54]	0.693	0.688	0.754	0.760
DIQaM-NR [21]	0.606	0.601	0.722	0.736
HyperIQA [22]	0.859	0.882	0.906	0.917
DB-CNN [23]	0.851	0.869	0.875	0.884
TS-CNN [55]	0.655	0.667	0.722	0.729
RAN4IQA [16]	0.586	0.612	0.752	0.763
VCRNet [33]	0.856	0.865	0.894	0.909
MT-IRN	0.865	0.872	0.899	0.912

Table 5. The SROCC results of the individual distortion types on LIVE, CSIQ, and TID2013. The top two results are shown in bold font. “Count” refers to the number of times a method achieves the top two results.

	Dist. Type	IQA-CNN [20]	DIQA [56]	HyperIQA [22]	RAN4IQA [16]	Hall-IQA [17]	VCRNet [33]	MT-IRN
LIVE	JP2K	0.936	0.961	0.949	0.958	0.969	0.975	0.977
	JPEG	0.965	0.976	0.961	0.923	0.975	0.979	0.980
	WN	0.974	0.986	0.982	0.973	0.992	0.988	0.985
	GB	0.952	0.962	0.926	0.964	0.973	0.978	0.973
	FF	0.906	0.912	0.934	0.893	0.953	0.962	0.965
CSIQ	JP2K	0.930	0.927	0.960	0.927	0.924	0.962	0.963
	JPEG	0.915	0.931	0.934	0.904	0.933	0.956	0.958
	WN	0.919	0.835	0.927	0.923	0.942	0.939	0.934
	GB	0.918	0.870	0.915	0.889	0.901	0.950	0.942
	PN	0.900	0.893	0.931	0.844	0.842	0.899	0.946
	CC	0.786	0.718	0.874	0.860	0.861	0.919	0.906
TID2013	AGN	0.784	0.916	0.942	0.866	0.923	0.844	0.892
	ANC	0.758	0.755	0.916	0.753	0.880	0.785	0.768
	SCN	0.762	0.878	0.947	0.842	0.945	0.787	0.961
	MN	0.776	0.734	0.801	0.462	0.673	0.795	0.781
	HFN	0.816	0.939	0.955	0.908	0.955	0.942	0.894
	IN	0.807	0.844	0.855	0.855	0.810	0.876	0.892
	QN	0.616	0.858	0.726	0.849	0.831	0.847	0.875
	GB	0.921	0.920	0.969	0.833	0.832	0.906	0.899
	DEN	0.872	0.788	0.941	0.839	0.957	0.937	0.880
	JPEG	0.874	0.892	0.898	0.939	0.914	0.934	0.897
	JP2K	0.910	0.812	0.947	0.912	0.624	0.906	0.918
	JGTE	0.686	0.862	0.934	0.566	0.460	0.762	0.852
	J2TE	0.678	0.813	0.892	0.778	0.782	0.865	0.892
	NPN	0.286	0.160	0.808	0.234	0.664	0.457	0.596
	BW	0.219	0.408	0.361	0.339	0.122	0.601	0.728
	MS	0.565	0.300	0.374	0.135	0.182	0.509	0.542
	CC	0.182	0.447	0.753	0.578	0.376	0.595	0.786
	CCS	0.081	0.151	0.857	0.484	0.156	0.855	0.719
	MGN	0.644	0.904	0.899	0.787	0.850	0.845	0.900
	CN	0.534	0.656	0.960	0.819	0.614	0.804	0.840
	LCNI	0.810	0.830	0.897	0.895	0.852	0.816	0.913
	ICQD	0.272	0.937	0.901	0.822	0.911	0.945	0.867
	CHA	0.892	0.757	0.870	0.762	0.381	0.932	0.828
	SSR	0.910	0.909	0.910	0.917	0.616	0.948	0.922
	Count	2	3	17	1	8	18	22

Table 6. The SROCC results of the cross-database test. The top two results are shown in bold font.

Training	LIVE			CSIQ
Testing	CSIQ	TID2013	LIVEC	LIVE	TID2013	LIVEC
DIIVINE [57]	0.582	0.373	0.300	0.815	0.419	0.366
CORNIA [58]	0.620	0.382	0.431	0.843	0.331	0.393
HOSA [59]	0.598	0.470	0.455	0.770	0.341	0.309
DB-CNN [23]	0.758	0.524	0.567	0.877	0.540	0.452
RAN4IQA [16]	0.632	0.462	0.157	0.806	0.471	0.116
Hall-IQA [17]	0.668	0.486	0.126	0.833	0.491	0.107
VCRNet [33]	0.768	0.502	0.615	0.886	0.542	0.463
MT-IRN	0.783	0.565	0.600	0.892	0.573	0.467
Training	TID2013			LIVEC
Testing	LIVE	CSIQ	LIVEC	LIVE	CSIQ	TID2013
DIIVINE [57]	0.714	0.585	0.230	0.362	0.417	0.337
CORNIA [58]	0.829	0.662	0.267	0.578	0.456	0.403
HOSA [59]	0.844	0.609	0.253	0.537	0.336	0.399
DB-CNN [23]	0.891	0.807	0.457	0.746	0.697	0.424
RAN4IQA [16]	0.795	0.673	0.101	0.297	0.286	0.153
Hall-IQA [17]	0.786	0.683	0.116	-	-	-
VCRNet [33]	0.822	0.721	0.307	0.746	0.566	0.416
MT-IRN	0.897	0.739	0.375	0.758	0.546	0.419

Table 7. The SROCC results of the ablation experiments. The top results are shown in bold font.

Baseline	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$
Single-Task Restoration Sub-Network		$\sqrt$
Multitask Restoration Sub-Network			$\sqrt$	$\sqrt$	$\sqrt$
Image Restoration Feature		$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$
Difference Feature				$\sqrt$	$\sqrt$
Multi-Scale Feature Fusion Module					$\sqrt$
LIVE	0.950	0.958	0.961	0.965	0.969
CSIQ	0.894	0.907	0.914	0.920	0.928
LIVEC	0.820	0.833	0.842	0.847	0.852

Table 8. Average PSNR and SSIM of pseudo-reference images and distorted images. The top results are shown in bold font.

	LIVE		CSIQ		TID2013
	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Distorted Images	27.499	0.715	27.433	0.770	26.864	0.773
Single-Task Pseudo-Reference Images	28.707	0.732	29.238	0.788	27.992	0.816
Multitask Pseudo-Reference Images	29.085	0.751	29.607	0.800	29.009	0.827

Table 9. Comparison of computation time and the number of parameters.

Method	Time (s)	Parameters (M)
IQA-CNN	0.06	0.17
MEON	0.18	4.28
DB-CNN	0.25	9.19
MetaIQA	0.56	33.18
Hall-IQA	0.45	23.67
VCRNet	0.28	16.66
MT-IRN	0.31	35.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, F.; Fu, H.; Yu, H.; Chu, Y. No-Reference Image Quality Assessment Based on a Multitask Image Restoration Network. Appl. Sci. 2023, 13, 6802. https://doi.org/10.3390/app13116802

AMA Style

Chen F, Fu H, Yu H, Chu Y. No-Reference Image Quality Assessment Based on a Multitask Image Restoration Network. Applied Sciences. 2023; 13(11):6802. https://doi.org/10.3390/app13116802

Chicago/Turabian Style

Chen, Fan, Hong Fu, Hengyong Yu, and Ying Chu. 2023. "No-Reference Image Quality Assessment Based on a Multitask Image Restoration Network" Applied Sciences 13, no. 11: 6802. https://doi.org/10.3390/app13116802

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

No-Reference Image Quality Assessment Based on a Multitask Image Restoration Network

Abstract

1. Introduction

2. Related Studies

3. Proposed Method

3.1. Multitask Image Restoration Sub-Network

3.2. Score Prediction Sub-Network

3.3. Network Training

4. Experiments

4.1. Databases and Experimental Protocols

4.2. Performance on Individual Database

4.3. Performance on Individual Distortion Types

4.4. Performance across Different Databases

4.5. Ablation Experiments

4.6. Performance of Image Restoration

4.7. Computational Complexity and Cost

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI