YOLOv6-ESG: A Lightweight Seafood Detection Method

Wang, Jing; Li, Qianqian; Fang, Zhiqiang; Zhou, Xianglong; Tang, Zhiwei; Han, Yanling; Ma, Zhenling

doi:10.3390/jmse11081623

Open AccessArticle

YOLOv6-ESG: A Lightweight Seafood Detection Method

¹

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China

²

Key Laboratory of Fishery Information, Ministry of Agriculture, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(8), 1623; https://doi.org/10.3390/jmse11081623

Submission received: 10 July 2023 / Revised: 17 August 2023 / Accepted: 18 August 2023 / Published: 20 August 2023

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The rapid development of convolutional neural networks has significant implications for automated underwater fishing operations. Among these, object detection algorithms based on underwater robots have become a hot topic in both academic and applied research. Due to the complexity of underwater imaging environments, many studies have employed large network structures to enhance the model’s detection accuracy. However, such models contain many parameters and consume substantial memory, making them less suitable for small devices with limited memory and computing capabilities. To address these issues, a YOLOv6-based lightweight underwater object detection model, YOLOv6-ESG, is proposed to detect seafood, such as echinus, holothurian, starfish, and scallop. First, a more lightweight backbone network is designed by rebuilding the EfficientNetv2 with a lightweight ODConv module to reduce the number of parameters and floating-point operations. Then, this study improves the neck layer using lightweight GSConv and VoVGSCSP modules to enhance the network’s ability to detect small objects. Meanwhile, to improve the detection accuracy of small underwater objects with poor image quality and low resolution, the SPD-Conv module is also integrated into the two parts of the model. Finally, the Adan optimizer is utilized to speed up model convergence and further improve detection accuracy. To address the issue of interference objects in the URPC2022 dataset, data cleaning has been conducted, followed by experiments on the cleaned dataset. The proposed model achieves 86.6% mAP while the detection speed (batch size = 1) reaches 50.66 FPS. Compared to YOLOv6, the proposed model not only maintains almost the same level of detection accuracy but also achieves faster detection speed. Moreover, the number of parameters and floating-point operations reaches the minimum levels, with reductions of 75.44% and 79.64%, respectively. These results indicate the feasibility of the proposed model in the application of underwater detection tasks.

Keywords:

underwater object detection; YOLOv6; lightweight network; EfficientNetv2; ODConv; GSConv; SPD-Conv

1. Introduction

With the booming development of computer vision, underwater object detection technology based on optical images is widely used and plays a significant role in marine fisheries [1], aquaculture, marine pollution protection, underwater unexploded ordnance detection [2], and so on. In the field of marine fisheries, most of the traditional seafood collection methods are based on manual diving fishing, which is not only inefficient but also requires workers to have sufficient experience in diving and fishing. Due to the harmfulness of long-term underwater operations, the reduced labor force has led to a continuous increase in the cost of manual fishing operations. On the other hand, underwater scenes are inherently more complex than land scenes, so the images obtained by underwater cameras tend to have lower quality. Fishing operations face great challenges due to the time constraints of fishing and the impact of the marine environment. Therefore, the development of underwater autonomous fishing equipment is of great significance for replacing traditional manual work.

Nowadays, automated operations are gradually becoming a popular research field, and underwater object detection algorithms based on underwater robots have become a hot topic in academic and applied research. Xu et al. [3] applied object detection algorithms DSOD [4] to underwater scenes. The model was trained to detect seafood such as holothurian, echinus and scallops using an underwater image dataset, and they trained a robot to catch seafood autonomously with 81.2% mAP, but the detection speed was slow only achieving 17 FPS. Due to the limitation of the underwater environment, most of the underwater images typically exhibit blue-green tones, blurred and low contrast, which bring great challenges to the detection and recognition of underwater robots. Improving the accuracy and speed of underwater object detection algorithms is still a thorny problem.

In recent years, object detection based on deep learning has become popular in the field of computer vision and is now gradually extending to the study of underwater scenes, which can be divided into one-stage and two-stage object detection frameworks. The two-stage algorithm first generates candidate regions of interest and then classifies them using convolutional neural networks (CNNs). Representative algorithms include Faster R-CNN [5], Cascade R-CNN [6], Mask R-CNN [7], etc. Liu et al. [8] introduced an optimized Faster R-CNN algorithm. This enhancement improved the feature extraction capability of the detection network, thereby enhancing the localization and recognition performance of small underwater objects such as holothurian, echinus, starfish, and scallops. Consequently, detection accuracy was improved. Similarly, Lin et al. [9] made improvements to the Cascade R-CNN algorithm to detect four types of underwater objects, including holothurian. With the same parameters, the method significantly improved the object detection accuracy compared to the basic Fast R-CNN [10]. However, the two-stage algorithm has a huge number of parameters and floating-point operations, and the detection speed is relatively slow.

The one-stage object detection algorithm has a simpler structure and faster speed, and the performances of recent models such as the YOLO series [11,12,13,14,15] are comparable to the two-stage ones. Such algorithms usually directly identify the category and location of the objects through the extracted network features. Recently, the improved algorithm for underwater object detection is more widely studied based on the YOLO series. In 2020, Wang et al. [16] utilized the YOLO model to train and predict on a clownfish dataset. By applying 3D rotations and scaling to objects in different backgrounds, the number of underwater images was expanded by over 1000-fold. The results revealed that the model trained with the data augmentation exhibited a significant enhancement in detection accuracy, rising from 56.7% to 70%. In 2021, Chou et al. [17] proposed an autonomous underwater vehicle (AUV) integrating YOLOv3-tiny to detect working divers. Diver-following experiments were conducted under surge-yaw following and heave-yaw following scenes, resulting in detection accuracies of 90.39% and 78.29%, respectively. In the same year, Shao et al. [18] fused shallow features with deep features based on the original YOLOv3 to improve the network’s ability to detect four types of underwater small objects, including mines, pipelines, base arrays, and submerged marks. This improved algorithm demonstrated 97% mAP for underwater sonar images, but it did not perform detection on underwater optical images. In 2022, Shi et al. [19] added a convolutional block attention module (CBAM) to the backbone network of YOLOv4, and they optimized the PANet network to improve its ability for feature extraction and feature fusion. They combined with the data enhancement method PredMix (prediction-mix) to improve the overlapping and occlusion of underwater seafood and enhance the robustness of the model. The detection accuracy of the model on the URPC2018 dataset has increased to 78.39%. Liu et al. [20] also added a CBAM module to the backbone feature extraction network of YOLOv5 and evaluated it on the URPC2021 dataset, which improved the recognition performance of the model to 79.2% mAP. Though the above-mentioned models for detecting marine organisms have improved the accuracy to some extent, the overall precision is relatively lower, not more than 80%.

In 2023, Zhang et al. [21] proposed an optimized underwater object detection model based on the YOLOv4 algorithm. Expanding upon the original network, they introduced an additional prediction head to facilitate the detection of objects of different sizes. Additionally, they integrated a channel attention mechanism into the network. Furthermore, the K-means++ was applied to cluster anchor boxes and different activation functions were used to improve the model’s performance. Through multiple integrated modules, the detection accuracy was elevated up to 91.1%. The enhancement came at the cost of a larger model size, which reached 182.7 MB. In the same year, Liu et al. [22] introduced a model based on YOLOv7 in this field. This model utilized the ACmixBlock module to replace the 3 × 3 convolutional block of the E-ELAN structure in the base model, while integrating skip connections and 1 × 1 convolutional block to enhance feature extraction capabilities. Simultaneously, they introduced the ResNet-ACmix module to prevent feature information loss and incorporated the Global Attention Mechanism (GAM) module to further amplify feature extraction. They employed the K-means++ algorithm to obtain anchor boxes as well. This series of enhancements led to the model demonstrating 89.6% mAP on the URPC2021 dataset and 97.4% mAP on the Brackish dataset. These improvements also resulted in an increase in the number of model parameters to 177.08 M. The latest two studies based on the YOLO algorithm have highly improved the detection performance on four types of seafood: holothurian, echinus, starfish, and scallop. On the other hand, these methods have concurrently led to a heightened complexity in the network architecture, resulting in an increase in both the number of model parameters and floating-point operations.

Most previous studies mainly have concentrated on incorporating extra feature extraction modules to improve detection accuracy. However, these networks tend to have a larger number of parameters and slow detection speeds, which makes it challenging to deploy them on embedded and mobile devices. Therefore, finding ways to reduce the model complexity and speed up the detection process while ensuring the detection accuracy gains extensive attention. In 2021, Zhang et al. [23] utilized the lightweight MobileNetV2 and the depthwise separable convolution method to enhance the backbone network of YOLOv4, which was also combined with an attention feature fusion module. This approach achieved a favorable balance between detection time and accuracy on the PASCAL VOC dataset and the Brackish dataset. This method achieved 79.54% mAP on the URPC2020 dataset. In 2022, Yeh et al. [24] proposed a deep model for jointly learning color conversion and object detection for underwater images. Initially, they converted images into grayscale to solve underwater color absorption and to improve subsequent detection performance with lower computational complexity. Their dataset primarily included three classes of objects: fish, debris, and divers. In this study, an improved model based on Feature Pyramid Network (FPN) was utilized, achieving 89.56% mAP. On the Brackish dataset, the detection accuracy reached 80.12%, with a computation complexity of only 5.06 GFLOPs. This allowed the model to be easily deployed on small-scale computing devices. However, the testing results on the URPC dataset are not reported. Han et al. [25] used the CenterNet model integrating the EfficientNet-B3 network as the backbone to reduce the model’s parameters. They also combined it with a scene feature fusion method, which effectively improved the model’s detection accuracy on the holothurian’s dataset of URPC, while not verifying on other types of seafood. Wang et al. [26] proposed an improved lightweight underwater object detection method based on the YOLOX model. They combined BIFPN-S and FPN and effectively fused with the features obtained from the backbone layer. The model was evaluated on the URPC dataset and its detection accuracy increased to 82.69%. Similarly in 2023, Shi et al. [27] incorporated ShuffleNet and attention mechanisms into the backbone network of YOLOv4 to reduce the number of model parameters, by using deep convolution to decrease the model size and the RFB-s module in the neck layer. The results evaluated on the holothurian dataset showed that these improvements reduced the model size to 49.2 M, but the detection accuracy also decreased from 93.12% to 92.01%. They did not validate on other types of seafood as well. Generally, optimizations for lightweight models focus on improving the backbone and neck layers. One of the most common approaches is to replace the original large backbone network with a lightweight one, which can effectively reduce the number of model parameters and floating-point operations. However, this comes along with some loss of accuracy. Thus, other methods are further tailored to the dataset and object characteristics to retain or enhance the model’s detection accuracy. These aim to make the proposed models easier to deploy on small devices, such as underwater robots.

This paper builds upon the basic model structure of the recent YOLOv6 [28] and optimizes its backbone and neck layers to better suit the characteristics of underwater datasets, which is named YOLOv6-ESG. The main contributions of this paper are as follows:

This paper employs a more efficient lightweight network, EfficientNetv2 [29], and further integrates it with lightweight convolution (ODConv [30]) to rebuild the backbone network. This approach significantly reduces the number of parameters and floating-point operations of the model.
In response to the problems of poor underwater image quality, low resolution, and difficulty in detecting small seafood, the SPD-Conv [31] (space-to-depth and non-strided convolution) module is utilized to further improve the detection accuracy of underwater objects.
This paper incorporates the lightweight GSConv [32] and VoVGSCSP [32] modules as basic building blocks for the neck layer of YOLOv6-ESG. The experimental results demonstrate that this approach can effectively balance the accuracy and speed of the model, highlighting the effectiveness of these modules in the proposed model.
During the training phase, a more efficient Adan [33] optimizer is used, which requires only half of the computing resources to achieve the optimal performance of the current model. This approach can further improve the detection accuracy of the model under the same computing resources.
Through the analysis of experimental results before and after model improvement, as well as comparisons with some of the current mainstream object detection algorithms, the effectiveness of the proposed model has been verified.

2. The YOLOv6 Model

The YOLOv6 model is a one-stage object detection model proposed by the Meituan Visual Intelligence Department in 2022. Compared with the former YOLO series, it has certain advantages on detection accuracy and inference efficiency. It builds on the previous YOLO series networks by redesigning the backbone and neck layers, as well as modifying the head layer. Along with the optimization of the network structure, the model also adopts a more streamlined anchor-free detection method and the SimOTA label allocation strategy in the training strategy. The YOLOv6 model consists of five basic models: YOLOv6s, YOLOv6t, YOLOv6m, YOLOv6n, and YOLOv6l. In terms of the detection performance on the dataset studied in this paper, the YOLOv6l model is selected as the basis for optimization.

The basic structure of YOLOv6l includes input layer, backbone layer, neck layer, and head layer. The input layer of YOLOv6l uses an input image resolution of 640 × 640 × 3 (with R, G, and B channels) and performs mosaic and mix-up data augmentation techniques to the original input image. This helps create more balanced object samples for underwater images. In the backbone layer, YOLOv6l adopts a redesigned and more efficient CSP (Cross Stage Partial) module based on the previous YOLO series, known as the CSPStackRep Block. Also known as the BepC3 module, this module contains three 1 × 1 convolutions and N/2 double RepVGG blocks, with additional features including residual connections and concatenation operations. CSP connections are employed within this module to enhance performance without excessive computational costs. Compared with CSPRepResStage [34], it is more compact and considers the balance between accuracy and speed. The neck layer of YOLOv6l follows the PAN (Path Aggregation Network) topology used in previous models like YOLOv5 but replaces the CSPBlock with the CSPStackRep Block. The width and depth of the block are adjusted accordingly to obtain the Rep-PAN structure in YOLOv6l, which enhances the feature extraction capability of the neck layer. In terms of the head layer, YOLOv6l still follows more of the structure in the previous YOLO series. But it uses a mixed-channel strategy to build the detection head, which further reduces computing costs and makes it more efficient.

3. Materials and Methods

This study focuses on optimizing the YOLOv6l model for application in underwater equipment with limited memory. The resulting YOLOv6-ESG model mainly improves the backbone and neck layers, while the structure of the head layer is preserved, as shown in Figure 1. Firstly, the cleaned dataset is used as the input, and then mosaic data augmentation is applied. The processed images are then resized and fed into the OD-E2 backbone layer, which extracts the input images’ features at three different scales. These features are subsequently transferred to the neck layer for further feature extraction. Multiscale feature fusion is then performed after upsampling or downsampling to obtain new fusion features of three different scales. Finally, the fused features are passed to the network’s head layer for prediction, where decoupled heads are employed to perform classification and regression prediction, enabling the detection of objects of various sizes.

3.1. OD-E2 Backbone Layer

The backbone network of YOLOv6 has strong feature extraction capabilities, but its complex structure and huge numbers of parameters have a certain impact on the detection speed. Therefore, this paper compared several popular lightweight backbone networks and chose the EfficientNetv2 network for further improvement. A more lightweight network is designed as the backbone network OD-E2 of YOLOv6-ESG by integrating EfficientNetv2 with the ODConv and SPD-Conv modules.

3.1.1. EfficientNetv2

EfficientNetv2 is a convolutional neural network proposed by the Google Brain team in 2021. It is an improvement of the previous EfficientNet [35] network, with the goal of improving accuracy and efficiency. EfficientNetv2 introduces the FusedMBConv [36] module based on the previous network, which replaces the 3 × 3 depthwise convolution and 1 × 1 expansion convolution in MBConv [37] with a 3 × 3 regular convolution (Conv 3 × 3), as shown in Figure 2a.

3.1.2. OD-FusedMBConv

The ODConv (Omni-dimensional Dynamic Convolution), as shown in Figure 2b, was proposed by Intel Labs in 2022. As a “plug-and-play” convolution, it can be easily embedded into existing CNN networks. ODConv utilizes a novel multidimensional attention mechanism and a parallel strategy to learn the attention of convolutional kernels along four dimensions of the kernel space at any convolutional layer. Therefore, it can use fewer convolution kernels to greatly improve the feature extraction ability of convolution. In this study, the ODConv module is used to replace the normal 3 × 3 convolution in the FusedMBConv module, further reducing the model’s computational complexity and decreasing memory access overhead. The improved module is called OD-FusedMBConv, as shown in Figure 2c.

3.1.3. SPD-Conv

In general, images captured in land-based scenes have good resolution and moderate object sizes. In such cases, object detection models employ designs such as stride convolution and pooling layers to skip redundant pixel information while still being able to learn object features effectively. However, in the challenging task of detecting small and blurry underwater objects, the assumption of redundant information becomes invalid, leading to the loss of fine-grained image details and insufficient learning of object features, resulting in decreased detection performance. To address this issue, the SPD-Conv module is introduced to solve the problem of low-resolution images and small objects.

The SPD-Conv module structure is shown in Figure 3. It was proposed in 2022 and consists of a space-to-depth (SPD) layer and a non-strided convolution layer. The SPD layer downsamples the extracted feature map but retains all pixel information in the channel dimension, thereby avoiding information loss. After each SPD layer, a non-strided convolution is added to reduce the number of channels using learnable parameters. By incorporating this module into our model, we can enhance the feature extraction capability for small underwater objects and further improve the model detection accuracy.

Based on EfficientNetv2, the OD-E2 backbone network is proposed to further reduce model computational complexity. Table 1 presents the architecture of the OD-E2 network. There are several main differences compared to the EfficientNetv2 network: First, the OD-FusedMBConv is proposed for further lightening the basic EfficientNetv2. Then, the SPD-Conv module is introduced to enhance the model’s feature extraction ability for low-resolution underwater images and small objects. Furthermore, the last two stages for feature extraction in the original EfficientNetv2 are completely removed, which is suitable for subsequent processing and reduces memory access overhead. Meanwhile, the last classification layer is removed to make it only exist as the backbone feature extraction network.

3.2. The Integrated Neck Layer

In this study, the GSConv module is introduced instead of the SimConv module in the neck layer of the original model. Meanwhile, the VoVGSCSP module replaces the original BepC3 module. SPD-Conv module is also integrated into the neck layer for enhancing feature extraction. This section mainly describes the lightweight GSConv and VoVGSCSP modules.

Common lightweight design mainly reduces the number of parameters and floating-point operations (FLOPs, the number of multiply-adds) using depthwise separable convolution (DWConv). However, during computation, DWConv separates the channel information of the input image. This defect leads to much lower feature extraction and fusion capabilities of DWConv than the vanilla convolution. As shown in Figure 4, GSConv is a mix convolution that combines the advantages of the vanilla convolution, DWConv, and a shuffle. Specifically, through using the shuffle, the information generated by the vanilla convolution is permeated into various parts of the information generated by DWConv. This module evenly exchanges local feature information across different channels, allowing information from the vanilla convolution to be fully mixed into the output of DWConv. The design of GSConv aims to make the output of convolutional computations as close as possible to the output of vanilla convolution and reduces computational costs. Based on GSConv, the GS bottleneck is introduced. Then the one-shot aggregation method is used to design the cross-stage partial network (GSCSP) module, named VoVGSCSP.

These improvements not only reduce the computational complexity and inference time of the detector but also maintains detection accuracy.

3.3. Adan Optimizer

Based on the improvement of the above model, a deep model optimizer called Adan is introduced for the training of YOLOv6-ESG model. It was jointly proposed by the research teams of Singapore Sea AI LAB (SAIL) and Peking University ZERO Lab in 2022. Under the same computing resources, Adan can effectively improve the detection accuracy of the model and has a faster convergence speed than previous SGD optimizer.

By combining the adapted Nesterov [38] momentum and adaptive optimization algorithms and introducing decoupled weight decay, the Adan optimizer is obtained. By using extrapolation points, Adan can anticipate the surrounding gradient information in advance, which efficiently helps to avoid sharp local minimum regions and increase the model’s generalization. The calculation method is shown in Formula (1):

\{\begin{array}{l} m_{k} = (1 - β_{1}) m_{k - 1} + β_{1} g_{k} \\ v_{k} = (1 - β_{2}) v_{k - 1} + β_{2} (g_{k} - g_{k - 1}) \\ n_{k} = (1 - β_{3}) n_{k - 1} + β_{3} {[g_{k} + (1 - β_{2}) (g_{k} - g_{k - 1})]}^{2} \\ α_{k} = α / (\sqrt{n_{k}} + ε) \\ θ_{k + 1} = {(1 + λ_{k} α)}^{- 1} [θ_{k} - α_{k} \circ (m_{k} + (1 - β_{2}) v_{k})] \end{array}

(1)

In each equation,

k

represents the number of steps in the update process.

m_{k}

represents the first moment of the gradient

g_{k}

,

v_{k}

represents the second moment of the gradient

g_{k}

, and

n_{k}

represents the third moment of the gradient

g_{k}

, where

g_{k}

represents the gradient obtained by taking the derivative of the loss function

f (θ)

with respect to

θ

.

α

is the learning rate used to control the step size, and

ε

is a small constant added to the denominator for numerical stability.

θ

represents the parameter to be updated.

β_{1}

represents the first moment decay coefficient,

β_{2}

represents the second moment decay coefficient, and

β_{3}

represents the third moment decay coefficient.

λ_{k}

is the weight decay coefficient.

First, the

θ_{0}

and learning rate

α

are initialized. Other parameters are usually set as follows: momentum

(β_{1}, β_{2}, β_{3}) \in {[0,1]}^{3}

, stability parameter

ε > 0

, weight decay coefficient

λ_{k} > 0

, set initial parameters

m_{0} = g_{0}

,

v_{0} = 0

,

v_{1} = g_{1} - g_{0}

, and

n_{0} = {g_{0}}^{2}

. Then the above Formula (1) is used to update the parameter

θ

.

4. Experiment Settings

4.1. Evaluation Indicators

The performance of the model is comprehensively assessed in terms of three metrics: detection accuracy, detection speed, and model complexity.

4.1.1. mAP

The detection accuracy is evaluated using the mAP (mean Average Precision), which is a commonly used evaluation metric in object detection. It provides a comprehensive evaluation of the detection performance, considering both Precision (P) and Recall (R). The formulas for calculating Precision and Recall are shown in (2) and (3):

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

R e c a l l = \frac{T P}{T P + F N}

(3)

where TP represents the number of actual positive ones in the samples predicted as positive class, FP represents the number of actual negative ones in the samples predicted as positive class, and FN represents the number of actual negative ones in the samples predicted as negative class. Generally, improving precision may lead to a decrease in recall. This relationship can be represented by the Precision-Recall (P-R) curve, where the area under the curve (AUC) represents the average precision (AP) for a category. For each category, the AP can be calculated using Formula (4).

A P = \int_{0}^{1} p (r) d r

(4)

The symbol p(r) represents the maximum precision value when the recall is greater than or equal to r (where r ranges from 0 to 1). In general, mAP is the average of the AP of all detected category, and its calculation is shown in Formula (5).

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(5)

where N represents the number of object categories contained in the dataset, and

{A P}_{i}

represents the average precision of the i-th category.

A higher mAP value indicates more accurate object detection and better performance of the detection model.

4.1.2. FPS

The detection speed is represented by Frames Per Second (FPS), which is a measure of the number of frames transmitted per second in the field of image processing. In the field of object detection, FPS is often used to evaluate the real-time performance of object detection models. The faster the detection speed, the more the system can detect objects in real time and determine their instant positions. The calculation for FPS is given by Formula (6):

F P S = \frac{1000}{T_{p r e} + T_{i n f e r} + T_{n m s}}

(6)

where

T_{p r e}

refers to the image preprocessing time,

T_{i n f e r}

refers to the model inference time,

T_{n m s}

can be understood as the postprocessing time, and the computing result is in milliseconds (ms).

4.1.3. Params and FLOPs

Model complexity can be evaluated from two aspects: the number of model parameters (Params) and the number of floating-point operations (FLOPs). Generally, the larger the number of model parameters and floating-point operations, the more complex the model, and the accuracy may also be improved. However, this also requires more computing resources during training and higher requirements for the device. It is difficult to deploy on small devices such as underwater robots. Therefore, under the premise of ensuring little loss in accuracy, reducing the number of parameters and floating-point operations as much as possible indicates better model performance.

4.2. Experimental Environment

The experiments in this study were all conducted in the same computing platform, as presented in Table 2. The used deep learning framework was Pytorch 1.8.0 + cu111, with an NVIDIA GeForce RTX 3090 GPU and Windows 10. The images were resized to 640 × 640, the batch size of the training process was set to 16, and the epoch was set to 300 iterations. The IDE was PyCharm with a programming environment of Python 3.7.

4.3. URPC2022 Dataset

The studied dataset was provided by the 2022 China Underwater Robot Professional Contest (URPC 2022) which consists of 9000 images captured in real marine environments. There is no inter-frame continuity between the images. The dataset includes images in various scales from different geographic environments and lighting conditions. Some sample data are shown in Figure 5.

The dataset consists of four types of seafood: holothurian, echinus, starfish, and scallop. However, there is also a small number of seaweed samples, which may cause interference to the experimental results. Therefore, the dataset is initially cleaned to ensure data quality and accuracy. The cleaned images are randomly divided into 7102 training samples, 887 validation samples, and 887 test samples in an 8:1:1 ratio for subsequent experiments.

The training set consists of samples from four different seafood. The number of each object is depicted in Figure 6. It is evident that there is an imbalance in the quantities among the various categories, with the holothurian category having the lowest number of samples. This category imbalance poses challenges during training as it may lead to underfitting issues, affecting the network’s learning ability.

5. Experimental Results and Discussion

In this study, the presented model was verified by quantitative comparisons with other recent mainstream detection models. Additionally, ablation experiments were conducted to verify the effectiveness of each improvement on the model.

5.1. Comparative Results with Other Models

Table 3 presents the experimental results of the currently popular two-stage models, namely, Faster RCNN (ResNet50) and Faster RCNN (VGG16), as well as the one-stage ones like RetinaNet, YOLOv5l, YOLOv6s, YOLOv6l, and YOLOv6-ESG (ours) for object detection. In these experiments, all models were evaluated using input images of size 640 × 640, trained for 300 iterations, and under consistent experimental conditions. Table 3 provides comparison results of evaluation metrics, including mAP for the four types of seafood, Params, FLOPs, and Speed, across different models on the URPC2022 dataset.

Upon analyzing the experimental results given in Table 3, it is evident that the proposed model outperformed the other six models in terms of the evaluation metrics Params and FLOPs. The proposed model had 14.36 M parameters and 29.28 GFLOPs, significantly lower than the other models. This indicated the proposed model was feasible for fast detecting underwater objects and deploying on real-time underwater equipment. From FPS as the evaluation index of detection speed, when the experimental parameter batch size was set to 1, the detection speed of YOLOv6 was generally lower than that of YOLOv5l, but the detection speed of the proposed model YOLOv6-ESG reached 50.66 FPS, which satisfied the real-time requirements of underwater object detection. On the other hand, when the batch size was set to 4, the proposed YOLOv6-ESG demonstrated better performance, reaching a detection speed of 140.06 FPS. This was 32.53 FPS, 2.51 FPS, and 38.02 FPS higher than YOLOv5l, YOLOv6s, and YOLOv6l, respectively. Based on the experimental results, the proposed YOLOv6-ESG exhibited superior detection speed capabilities. Analyzing the results of the [email protected], the improved model YOLOv6-ESG achieved a detection accuracy of 86.6%. This represented an increase of 13.7%, 11.2%, 28.8%, 1.3%, and 1.1% compared to the first five models, respectively.

Although it slightly lagged behind the YOLOv6l model in terms of detection accuracy, the improved YOLOv6-ESG model offered the advantage of having lower parameters and computational costs. Moreover, it achieved faster detection speed, effectively striking a better balance between detection accuracy and speed. This combination of factors made it well-suited for real-time underwater object detection, meeting the demands of practical applications.

To provide a more comprehensive analysis of the model’s performance, Table 4 provides the results of different models in terms of accuracy in each seafood category, including the

{A P}_{h o}

(accuracy for holothurian),

{A P}_{e c}

(accuracy for echinus),

{A P}_{s t}

(accuracy for starfish), and

{A P}_{s c}

(accuracy for scallops). This enables a detailed evaluation of how well each model performs in accurately detecting and classifying different seafood.

Table 4 illustrates the detection accuracy of each seafood at IoU (Intersection over Union) thresholds of 0.5. It can be concluded that YOLOv6l exhibited significant advantages in terms of detection accuracy. This was the reason why this paper selected YOLOv6l as the base model for research. Comparing the results vertically, it is evident that the proposed YOLOv6-ESG model achieved optimal performance in detecting echinus, with an accuracy of 89.6%. This outperformed the other models by margins of 10.9%, 8%, 14.9%, 2.1%, 1.1%, and 0.6%, respectively. For other seafood detection, the proposed model is slightly inferior to YOLOv6l, with only a marginal decrease in accuracy. In summary, the proposed YOLOv6-ESG model demonstrated superior performance in detecting echinus with significantly higher accuracy than the other models. While it slightly lagged behind YOLOv6l in other seafood detection, the model still maintained a commendable level of accuracy.

From a comprehensive (horizontal) perspective, the detection accuracy of echinus and starfish was significantly higher than the other two seafood, among which the detection accuracy of holothurian was generally lower. There are two aspects to consider in the analysis: (1) From the perspective of the dataset, the dataset exhibited an imbalance in the number of samples for each class. Echinus had the highest number of labeled samples, followed by starfish, while holothurian was the fewest. As a result, the model could learn more detailed features of echinus and starfish, leading to higher recognition accuracy. However, it is possible that the model had limited exposure to the distinctive features of holothurian and had not effectively learned their unique characteristics, which could lead to a decrease in recognition accuracy. (2) From the perspective of class characteristics, the characteristics of echinus have distinctive spines and a predominantly black appearance, while starfish exhibit clear “five-pointed star” features and are typically blue in color. These characteristics made them relatively easy to identify. On the other hand, holothurian exhibited colors and patterns that resembled the background, such as seagrass. And their variable shape and less distinct features made it more difficult to distinguish accurately from their surroundings. Scallops are typically white in color, but their surfaces could be covered with algae and other impurities, which made their features less prominent and led to lower recognition accuracy.

Taking all aspects into consideration, the proposed YOLOv6-ESG model exhibited slightly lower accuracy compared to the YOLOv6l model by 0.2%. However, it significantly outperformed YOLOv6l in terms of the number of parameters and floating-point operations, while also achieving higher FPS. This made it more suitable for deployment on underwater robots and other devices that require real-time performance. Although the proposed model had a slower detection speed compared to the YOLOv5l model (batch size = 1), it achieved higher detection accuracy. Moreover, the reduced parameters and computational demands of the proposed model enhance its feasibility for underwater object detection and recognition tasks.

5.2. Qualitative Analysis of Prediction Results

This section mainly focuses on the prediction results of images under various models. Figure 7 shows some prediction results of different models in various scenarios under different water and lighting conditions, such as color distortion, blurriness, small objects, and aggregation states. The experimental environment remained the same, except for the different colors of a few detection boxes.

Based on the analysis of Figure 7, it can be observed that the underwater dataset presented varying degrees of low-quality images. When comparing the detected objects with Ground Truth, most models performed well in detecting most objects. However, there were still some instances of missed detections and false positives: in the case of image (a), which featured a background with abundant underwater vegetation and a significant color shift. Despite this challenge, the objects remained relatively clear visually. The Faster RCNN (ResNet50) and Faster RCNN (VGG16) models could detect all existing objects. However, they also exhibited some false detections, incorrectly identifying several background elements as objects. This had an impact on the overall detection efficiency. The RetinaNet model demonstrated some instances of missed detections, where a few objects were not accurately detected. On the other hand, other models including the proposed model in this work exhibited excellent performance in detecting the object positions without occurrences of missed or false detections. In the case of image (b), a “fog” effect was presented on the surface and the image appeared blurry, making it difficult to discern the objects. A comparison with Ground Truth revealed that the Faster RCNN (ResNet50) model could detect all objects in the image. However, it exhibited a few instances of false detections. The Faster RCNN (VGG16), RetinaNet, YOLOv5l, YOLOv6s, and YOLOv6l models showed varying degrees of missed detections. Some holothurians, which were relatively concealed, were not fully detected by these models. However, the proposed model achieved results that were entirely consistent with the ground truth. It successfully detected all objects without any instances of missed or false detections. These findings indicated that the proposed model YOLOv6-ESG demonstrated superior detection performance for underwater images with slight blurriness and similar issues. In image (c), the image background was murky and cluttered, with significant blurriness. The objects in the image were small and bore a resemblance to the background, making their features extremely indistinct. As a result, all the experimental models exhibited varying degrees of missed and false detections. The outcome indicated that when dealing with datasets featuring turbid backgrounds and indistinct objects, it became challenging for all the studied models to accurately detect all objects based only on learned object features. The detection performance in such cases was subpar. Regarding image (d), where the image exhibited no noticeable color deviation and the background was relatively clear, the objects were clustered together. The experimental models demonstrated excellent detection ability. Interestingly, even in cases where Ground Truth did not label an object (in the lower right corner of the image), each model was still capable of detecting the object category and position based on the learned features. This observation suggested that in scenarios where multiple objects were clustered together, human observers may also encounter instances of oversight. In such cases, combining the model’s detection with manual labeling can yield more accurate and reliable results.

In conclusion, the proposed model YOLOv6-ESG demonstrated better performance in detecting object categories and positions, with minimal instances of false detections. It could even detect unlabeled objects in clustered scenarios. The model performed well in scenarios with slight blurriness and small objects. However, its detection performance was compromised in images with heavily turbid backgrounds or significant blurriness. Therefore, to ensure detection accuracy, it is advisable to avoid conducting underwater fishing operations during unfavorable weather conditions or when the marine environment is heavily disturbed.

5.3. Ablation Experiments

To assess the effectiveness of the proposed optimization based on the YOLOv6l model, ablation experiments were conducted. In order to ensure comparability, the environmental configuration for all experiments remained consistent. The training iterations were set to 300 and batch size was set to 1 or 4 for different experiments. The ablation experiments focused on making improvements in the backbone layer, neck layer, and optimizer.

In the backbone layer, the proposed model utilized the improved OD-E2 as the backbone network. In the neck layer, it incorporated the lightweight GSConv module (abbreviated as GS in Table 5) and the lightweight VoVGSCSP module (abbreviated as VoV in Table 5). To further tackle the difficulties specific to underwater images, the SPD-Conv module (abbreviated as SPD in Table 5) was introduced in both the backbone and neck layers. These modules enhance the model’s adaptability and detection performance for underwater images. As displayed in Table 5, the ablation experiments were independently numbered to evaluate the impact of different improvements on the model’s performance.

From Table 5, it can be observed that this study attempted different combinations of modules to improve the YOLOv6l model. When the backbone network was replaced with the lightweight EfficientNetv2, there was a reduction in both model parameters and floating-point operations. This validated the effectiveness of the lightweight modification. However, this improvement came at the cost of a significant decrease in model accuracy. In Exp. 3, when it was replaced with the proposed OD-E2 network, there was a decrease in model accuracy compared to Exp. 2. However, there was a further reduction in the number of parameters and floating-point operations. In order to enhance the overall detection accuracy of the model, the study utilized the more efficient Adan optimizer. The experimental results revealed that this optimizer only introduced a slight increase in computational complexity but resulted in a significant improvement of 4.1% in detection accuracy. This finding underscored the compatibility of the Adan optimizer with the proposed model, highlighting its superior performance compared to the original optimizer. Based on Exp. 4, Exp. 5 was conducted to focus on the SPD-Conv module. Despite a slight increase in both the number of parameters and floating-point operations, there was a notable improvement in detection accuracy. This outcome provided strong evidence for the practicality and effectiveness of the SPD-Conv module. Exp. 6 introduced the improved model proposed in this study. Building on the findings of Exp. 5, the neck layer of the model incorporated the GSConv and VoVGSCSP modules to replace the original modules. This modification resulted in the lowest number of parameters and floating-point operations. As a result, the model demonstrated an equal detection accuracy of 86.6% compared to basic YOLOv6l while achieving faster detection speed.

In summary, the improvements of the backbone and neck layers significantly reduced the number of the model’s parameters and floating-point operations. Although there was a slight decrease in model accuracy, it was the trade-off with the faster detection speed, which ensured that the model met the real-time requirements of object detection.

Table 6 presents the detection results of different models for four categories of seafood. All experiments were conducted under consistent environmental settings. The detailed experimental results are given in Table 6.

Based on the experimental results from Table 6, it can be observed that the model designed in this study achieved favorable performance. Compared to the results of the YOLOv6l model, the proposed model demonstrated better performance on the echinus dataset. Although it might not achieve the best results for the other three categories, the performance was close to that of the YOLOv6l model. In terms of the detection results for each category within the same model, the proposed model YOLOv6-ESG in this study performed best in detecting starfish, followed by echinus. However, the detection performance for the other two categories of seafood was relatively poor. This may also result from the fact that the echinus and starfish datasets had a larger number of samples and possessed distinct color or shape features, allowing the model to learn these features effectively. Conversely, the other two categories had fewer training samples and inconspicuous features. They might have a similar background color or be partially covered by seaweed, making it challenging for the model to learn the subtle details. As a result, the detection performance for these categories was not as satisfactory.

The results of the ablation experiments demonstrated that with the addition of each method, the model’s detection accuracy improved or the number of parameters and floating-point operations decreased. This indicated the effectiveness of the introduced modules in achieving lightweight improvements. Furthermore, it validated the effectiveness of the model’s improvement methods. In order to ensure the robustness of underwater object detection and better adaptability to various marine environments, the dataset used in this study did not undergo any image preprocessing or similar operations. The proposed model demonstrated favorable detection performance for all four categories of seafood. Additionally, the model achieved a significant reduction in the number of parameters and floating-point operations with a faster detection speed. These experimental results highlight the effectiveness of the proposed model for underwater object detection, indicating its viability for devices in underwater environments for detection and recognition tasks.

6. Conclusions

This study provides a feasible lightweight method YOLOv6-ESG based on YOLOv6 for real-time detection and identification in fishing operations. The former proposed methods in the field of marine organisms’ recognition, which integrates various techniques to enhance accuracy, may potentially impact energy consumption. To address this concern, this work gives particular attention to energy efficiency in the method design. A series of measures are taken to mitigate the potential impact of high energy consumption. First and foremost, the techniques introduced in this study mainly focus on the implementation of lightweight structures. The proposed method integrates a more lightweight backbone network OD-E2 along with an optimized neck layer. The SPD-Conv module is also combined to enhance the model structure with only a small increase in computational cost, and the original SimConv and BepC3 modules in YOLOv6l are transformed into lightweight GSConv and VoVGSCSP modules to further reduce the model’s parameters and floating-point operations. Additionally, the Adan optimizer is utilized in model training to enhance the training process, thus speeding up the model convergence. These enhancements have reduced computational cost and memory consumption during the training and validation processes, while maintaining the comparable prediction performance with the original YOLOv6l model.

Furthermore, the proposed model was extensively evaluated using a real-world underwater image dataset in this study. The experimental results show that compared with other significant underwater object detection models, the YOLOv6-ESG model with the proposed improvements achieves an accuracy of 86.6% while having the lowest number of parameters and floating-point operations. The model not only ensures detection accuracy but also accelerates the model training and detection process under the same computing resources, making it more suitable for underwater equipment in object detection tasks. The underwater video detection results of YOLOv6-ESG model can be found in the Supplementary Materials. In the future, this study would further optimize the method for implementation on underwater autonomous fishing equipment for further validation.

Supplementary Materials

The following supporting information can be downloaded at https://zenodo.org/record/8108546 (accessed on 5 July 2023), Video S1: Detection results of YOLOv6-ESG underwater video.

Author Contributions

Writing—original draft preparation, J.W. and Q.L.; writing—review and editing, Z.F., X.Z., Z.T. and Y.H.; investigation, J.W., Q.L. and Y.H.; funding acquisition, J.W., Y.H. and Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant numbers: 42176175, 42101443, and 61806123), the National Key R&D Program of China (grant number: 2019YFD0900805), and the Shanghai Sailing Program (grant number: 16YF1415700).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Enquiries regarding the experimental data should be made by contacting the first author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mana, S.C.; Sasipraba, T. An intelligent deep learning enabled marine fish species detection and classification model. Int. J. Artif. Intell. Tools 2022, 31, 2250017. [Google Scholar] [CrossRef]
Czub, M.; Kotwicki, L.; Lang, T.; Sanderson, H.; Klusek, Z.; Grabowski, M.; Szubska, M.; Jakacki, J.; Andrzejewski, J.; Rak, D. Deep sea habitats in the chemical warfare dumping areas of the Baltic Sea. Sci. Total Environ. 2018, 616, 1485–1497. [Google Scholar] [CrossRef]
Fengqiang, X.; Peng, D.; Huibing, W.; Xianping, F. Intelligent detection and autonomous capture system of seafood based on underwater robot. J. Beijing Univ. Aeronaut. Astronaut. 2019, 45, 2393–2402. [Google Scholar]
Shen, Z.; Liu, Z.; Li, J.; Jiang, Y.-G.; Chen, Y.; Xue, X. Dsod: Learning deeply supervised object detectors from scratch. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1919–1927. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Yuhao, L. Research on Detection and Recognition Technology of Underwater Small Target Based on Faster R-CNN. Master Dalian Univ. Technol. 2021, 100, 104190. [Google Scholar]
Yu, L.; Shuiyuan, H. Improved Cascade RCNN for underwater object detection. Electron. World 2022, 01, 105–108. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding Yolo Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, C.-C.; Huang, C.-Y.; Lin, C.-H.; Yeh, C.-H.; Liu, G.-X.; Chou, Y.-C. 3D-Modeling Dataset Augmentation for Underwater AUV Real-time Manipulations. In Proceedings of the 2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), Halong, Vietnam, 8–10 December 2020; pp. 145–148. [Google Scholar]
Chou, Y.-C.; Chen, H.-H.; Wang, C.-C.; Chou, H.-M.; Wang, C.-C. An AI AUV enabling vision-based diver-following and obstacle avoidance with 3D-modeling dataset. In Proceedings of the 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Washington, DC, USA, 6–9 June 2021; pp. 1–4. [Google Scholar]
Huixiang, S.; Dan, Z. Classification and recognition of underwater small targets based on improved YOLOv3 algorithm. J. Shanghai Univ. (Nat. Sci. Ed.) 2021, 27, 481–491. [Google Scholar]
Pengfei, S.; Song, H.; Jianjun, N.; Xin, Y. Underwater object detection algorithm combining dataenhancement and improved YOLOv4. J. Electron. Meas. Instrum. 2022, 36, 113–121. [Google Scholar]
Liu, Z.; Zhuang, Y.; Jia, P.; Wu, C.; Xu, H.; Liu, Z. A Novel Underwater Image Enhancement Algorithm and an Improved Underwater Biological Detection Pipeline. J. Mar. Sci. Eng. 2022, 10, 1204. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, G.; Li, H.; Liu, H.; Tan, J.; Xue, X. Underwater target detection algorithm based on improved YOLOv4 with SemiDSConv and FIoU loss function. Front. Mar. Sci. 2023, 10, 1153416. [Google Scholar] [CrossRef]
Liu, K.; Sun, Q.; Sun, D.; Peng, L.; Yang, M.; Wang, N. Underwater target detection based on improved YOLOv7. J. Mar. Sci. Eng. 2023, 11, 677. [Google Scholar] [CrossRef]
Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight underwater object detection based on yolo v4 and multi-scale attentional feature fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Yeh, C.-H.; Lin, C.-H.; Kang, L.-W.; Huang, C.-H.; Lin, M.-H.; Chang, C.-Y.; Wang, C.-C. Lightweight deep neural network for joint learning of underwater object detection and color conversion. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6129–6143. [Google Scholar] [CrossRef] [PubMed]
Han, Y.; Chen, L.; Luo, Y.; Ai, H.; Hong, Z.; Ma, Z.; Wang, J.; Zhou, R.; Zhang, Y. Underwater Holothurian Target-Detection Algorithm Based on Improved CenterNet and Scene Feature Fusion. Sensors 2022, 22, 7204. [Google Scholar] [CrossRef]
Wang, J.; Qi, S.; Wang, C.; Luo, J.; Wen, X.; Cao, R. B-YOLOX-S: A Lightweight Method for Underwater Object Detection Based on Data Augmentation and Multiscale Feature Fusion. J. Mar. Sci. Eng. 2022, 10, 1764. [Google Scholar] [CrossRef]
Xianpeng, S.; Honggui, W. Improved lightweight underwater target detection network based on YOLOV4 (you only look once v4). J. Harbin Eng. Univ. 2023, 44, 154–160. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Li, C.; Zhou, A.; Yao, A. Omni-Dimensional Dynamic Convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects; Springer: Cham, Switzerland, 2023; pp. 443–459. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Better Design Paradigm of Detector Architectures for Autonomous Vehicles. arXiv 2022, arXiv:abs/2206.02424. [Google Scholar]
Xie, X.; Zhou, P.; Li, H.; Lin, Z.; Yan, S. Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models. arXiv 2022, arXiv:2208.06677. [Google Scholar]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y. PP-YOLOE: An Evolved Version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Gupta, S.; Tan, M. EfficientNet-EdgeTPU: Creating accelerator-optimized neural networks with AutoML. Google AI Blog 2019, 2. Available online: https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html (accessed on 1 June 2023).
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Chen, C.; Wang, Z.; Fan, Y.; Zhang, X.; Li, D.; Lu, Q. Nesterov Adam Iterative Fast Gradient Method for Adversarial Attacks; Springer International Publishing: Cham, Switzerland, 2022; pp. 586–598. [Google Scholar]

Figure 1. YOLOv6-ESG network structure diagram. The improved modules in the backbone and neck layers are introduced in Section 3.1 and Section 3.2.

Figure 2. The structure of OD-FusedMBConv (ODConv + FusedMBConv) module in the backbone network: (a) the structure diagram of the FusedMBConv module, (b) the structure diagram of the lightweight ODConv module, and (c) the structure diagram of the OD-FusedMBConv module designed in this work, which is the integration of FusedMBConv and ODConv modules.

Figure 3. The overall framework of the SPD-Conv module.

Figure 4. The structure of the GSConv module. The “Conv” box consists of three layers: a convolutional-2D layer, a batch normalization-2D layer, and an activation layer.

Figure 5. Sample images from URPC2022 in different geographic environments and lighting conditions, which contain four types of seafood: holothurian, echinus, starfish, and scallop.

Figure 6. The number of the four types of seafood included in the cleaned URPC2022 training dataset.

Figure 7. Some prediction results of different models. (a–d) The randomly selected color shift, blur, small object, and aggregation state prediction result images, where Ground Truth referred to the object existing in the original image (only label), and the other rows referred to the detection results (with labels and probabilities) of images under different models.

Table 1. OD-E2 network architecture.

Stage	Operator	Stride	#Channels	#Layers
0	Conv3 × 3	2	24	1
1	OD-FusedMBConv1, k3 × 3	1	24	2
2	SPD-Conv	-	-	1
3	OD-FusedMBConv4, k3 × 3	1	48	4
4	SPD-Conv	-	-	1
5	OD-FusedMBConv4, k3 × 3	1	64	4
6	SPD-Conv	-	-	1
7	MBConv4, k3 × 3	1	128	6
8	SPD-Conv	-	-	1
9	MBConv6, k3 × 3	1	160	9
10	SPPF, k5 × 5	-	160	1

Table 2. Experimental environment configuration parameters.

Environment	Versions or Model Number
OS	Windows 10
CPU	12th Gen Intel(R)Core (TM)i7-12700KF
GPU	NVIDIA GeForce RTX 3090
CUDA	V 11.1
Pytorch	V 1.8.0
Python	V 3.7

Table 3. Experimental results of different underwater object detection models.

Model	[email protected] (%)	Params (M)	FLOPs (G)	Speed b1 (FPS)	Speed b4 (FPS)
Faster RCNN(ResNet50)	72.9	28.31	474.08	26.77	-
Faster RCNN(VGG16)	75.4	136.75	200.88	35.38	-
RetinaNet	57.8	36.39	164.55	53.33	-
YOLOv5l	85.3	46.12	107.7	66.23	107.53
YOLOv6s	85.5	17.19	44.07	46.62	137.55
YOLOv6l	86.8	58.47	143.8	44.13	102.04
YOLOv6-ESG(Ours)	86.6	14.36	29.28	50.66	140.06

Table 4. Experimental results of each underwater object detection model.

Model	IoU = 0.5
Model	${A P}_{h o}$ /%	${A P}_{e c}$ /%	${A P}_{s t}$ /%	${A P}_{s c}$ /%
Faster RCNN(Resnet50)	71.9	78.7	77.8	63.2
Faster RCNN(VGG16)	72.8	81.6	80.1	66.9
RetinaNet	50.4	74.7	66.8	39.6
YOLOv5l	80.8	87.5	88.8	83.9
YOLOv6s	81.6	88.5	89	82.8
YOLOv6l	83.4	89	89.9	84.7
YOLOv6-ESG(Ours)	82.8	89.6	89.8	84

Table 5. Ablation experiment results of different models.

Exp.	Model	[email protected] (%)	[email protected]:.95 (%)	Param (M)	FLOPs (G)	Speed b1 (FPS)	Speed b4 (FPS)
1	YOLOv6l	86.8	53.1	58.47	143.80	44.13	102.04
2	YOLOv6l + EfficientNetV2	84.1	49.8	24.18	70.12	44.29	129.20
3	YOLOv6l + OD-E2	80.7	46.5	23.42	47.50	41.32	102.67
4	YOLOv6l + OD-E2 + Adan	84.8	50.3	23.91	47.90	42.81	103.73
5	YOLOv6l + OD-E2 + Adan + SPD	86.2	51.6	26.71	54.87	41.36	98.81
6	YOLOv6l + OD-E2 + Adan + SPD + GS&VoV (ours)	86.6	51.9	14.36	29.28	50.66	140.06

Table 6. Ablation experiment results for four categories of seafood.

Model	IoU = 0.5				IoU = 0.5:0.95
Model	${A P}_{h o}$ /%	${A P}_{e c}$ /%	${A P}_{s t}$ /%	${A P}_{s c}$ /%	${A P}_{h o}$ /%	${A P}_{e c}$ /%	${A P}_{s t}$ /%	${A P}_{s c}$ /%
YOLOv6l	83.4	89	89.9	84.7	50.3	50.9	56.6	54.5
YOLOv6l + E2	79.5	87.8	88.4	80.5	46	49.9	53.5	49.6
YOLOv6l + OD-E2	71.9	87.2	87.2	76.5	40.5	48.1	51.3	45.8
YOLOv6l + OD-E2 + Adan	80.2	88.1	88.9	81.8	45.6	50.6	54.3	50.9
YOLOv6l + OD-E2 + Adan + SPD	82.2	89.5	89.4	83.7	47.7	51.1	55.1	52.6
YOLOv6l + OD-E2 + Adan + SPD + GS&VoV (ours)	82.8	89.6	89.8	84	48.7	50.7	55.2	53

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Li, Q.; Fang, Z.; Zhou, X.; Tang, Z.; Han, Y.; Ma, Z. YOLOv6-ESG: A Lightweight Seafood Detection Method. J. Mar. Sci. Eng. 2023, 11, 1623. https://doi.org/10.3390/jmse11081623

AMA Style

Wang J, Li Q, Fang Z, Zhou X, Tang Z, Han Y, Ma Z. YOLOv6-ESG: A Lightweight Seafood Detection Method. Journal of Marine Science and Engineering. 2023; 11(8):1623. https://doi.org/10.3390/jmse11081623

Chicago/Turabian Style

Wang, Jing, Qianqian Li, Zhiqiang Fang, Xianglong Zhou, Zhiwei Tang, Yanling Han, and Zhenling Ma. 2023. "YOLOv6-ESG: A Lightweight Seafood Detection Method" Journal of Marine Science and Engineering 11, no. 8: 1623. https://doi.org/10.3390/jmse11081623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv6-ESG: A Lightweight Seafood Detection Method

Abstract

1. Introduction

2. The YOLOv6 Model

3. Materials and Methods

3.1. OD-E2 Backbone Layer

3.1.1. EfficientNetv2

3.1.2. OD-FusedMBConv

3.1.3. SPD-Conv

3.2. The Integrated Neck Layer

3.3. Adan Optimizer

4. Experiment Settings

4.1. Evaluation Indicators

4.1.1. mAP

4.1.2. FPS

4.1.3. Params and FLOPs

4.2. Experimental Environment

4.3. URPC2022 Dataset

5. Experimental Results and Discussion

5.1. Comparative Results with Other Models

5.2. Qualitative Analysis of Prediction Results

5.3. Ablation Experiments

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI