1. Introduction
Currently, with the development of automatic intelligent aquaculture and fishing technology and the improvement of human beings’ living standards, the demand for aquatic products is increasing gradually. One of these products is sea cucumbers, with their high nutritional value and memory-enhancing and anti-tumor effects. The sea cucumber aquaculture industry has become a major industry in certain coastal areas, bringing increases in the income of fishermen and also promoting the development of secondary and tertiary industries, such as processing and transportation. However, there are various problems that make the fishing of sea cucumbers troublesome. Sea cucumbers only have two dormant periods in a year; thus, fishing operations can only be performed in spring and autumn. Additionally, due to the presence of abundant reefs in the living environment of sea cucumbers, it is unsuitable to use fishing nets for their capture. The fishing operations of sea cucumbers in most marine pastures are conducted by professional staff, who are supposed to put on oxygen masks and dive into the seabed. This traditional method of artificial fishing requires high levels of skills and includes the risk of various occupational diseases due to the low temperatures of seawater in spring and autumn, the frequent changes in water pressure during diving and surfacing, and the complex seabed working environment. Therefore, replacing manual work with intelligent underwater robots that can capture sea cucumbers automatically has become the development trend [
1].
Perceiving the underwater environment is an integral part of intelligent underwater fishing robot systems, such as Remotely Operated Vehicles (ROVs) and Autonomous Underwater Vehicles (AUVs) [
2]. The system perceives information on the subsea environment through acoustic or optical sensors and then takes corresponding actions based on the surrounding area. Therefore, the autonomous detection of underwater sea cucumbers is a necessary step for subsea robots to localize and capture sea cucumbers automatically. High resolution and rich information make underwater optical imaging the most intuitive and commonly used method for data acquisition. Nevertheless, the turbidity and poor light transmittance of water can cause a crucial decline in the clarity of underwater imageries, presenting difficulties in the application of vision-based underwater target detection [
3]. Turbidity is often encountered in sea cucumbers’ complex living environment. The direction of light transmission is affected by the scattering and absorption of water and various organic and inorganic suspended particles like fish and sediment, which results in image distortion, including blurred target features, color variance, and severe distortion. In addition, the low light conditions lead to the retrieval of limited effective target light information. Thus, current research works on underwater vision are mainly focusing on scenarios with good water conditions. Moreover, marine animals like sea cucumbers are usually small in size, making it difficult to detect and recognize them. Therefore, research on feature detection and target recognition in complex and changeable underwater areas is challenging but essential.
Target detection in underwater areas needs to consider image restoration and enhancement. Compared with that in an atmospheric environment, traditional detection approaches and deep-learning-based methods are mainly taken into account. In the classic methods, the regions of interest are first selected through sliding windows, and features within them are extracted through conventional algorithms including Scale-Invariant Feature Transform (SIFT) [
4], Histograms of Oriented Gradients (HOG) [
5], etc. Then, machine-learning algorithms such as Support Vector Machine (SVM) [
6] are applied to classify the extracted features and determine whether the region contains targets [
7]. Deep-learning-based approaches study image sets by training neural networks and establishing logical relationships to enhance the clarity of the image and extract target features for intelligent recognition [
8]. Other methods for underwater target detection are also studied, including sonar imaging [
9], laser imaging, and polarization imaging [
10,
11].
The main contributions of this present work are summarized as follows:
The state-of-the-art underwater sea cucumber detection methods are summarized, including traditional methods, one-stage methods based on deep learning such as You Only Look Once (YOLO) series algorithms and Single Shot MultiBox Detector (SSD), two-stage methods based on deep learning such as R-CNN series algorithms, anchor-free approaches such as DEtection TRansformer (DETR), and other methods.
For the detection of sea cucumbers, fundamentals of YOLOv5 and DETR are first introduced. Then, in the training process, the test results of YOLOv5 and DETR and a performance comparison of these two approaches are presented, proving the excellent performance of YOLOv5 and DETR in underwater adjacent and overlapping sea cucumber detection.
Relevant research methods and the latest achievements on underwater target detection are systematically collated, and then experiments based on YOLO and DETR are carried out on the derived sea cucumber datasets.
The rest of the manuscript is organized as follows:
Section 2 briefly describes the research related to underwater target detection and its recent developments. The fundamentals of YOLOv5 and DETR are described in
Section 3.
Section 4 demonstrates a detection performance comparation of YOLO and DETR on the sea cucumber dataset. Finally, in
Section 5, conclusions are drawn, and current problems are discussed to provide a reference for future work.
2. Related Works
With the development of underwater image-processing and target detection techniques, many conventional algorithms and frameworks have been developed. Conventional target detection approaches extract the features of target zones manually, which is time-consuming and has poor robustness. For most of the conventional methods, the candidate regions are first selected through different sizes of sliding windows. Then, features in these windows are extracted. Finally, machine-learning algorithms are applied for recognition. Classic algorithms such as HOG [
5] and Deformable Part Model (DPM) [
12] have some limitations. The region selection strategy is not targeted, which leads to high computation consumption and window redundancy. Additionally, artificially designed methods are not as robust as required considering feature diversity [
13].
With the emergence of the deep convolutional neural network, great breakthroughs have been achieved in object detection algorithms. Generally, approaches based on deep learning outperform those traditional approaches, most of which demand manual intervention. Thus, deep-learning methods are more suitable to deploy on underwater robots. Existing target detection algorithms are mainly divided into two categories: region proposal-based ones, also known as two-stage algorithms, and regression-based ones, also referred to as one-stage algorithms [
14]. Two-stage algorithms first extract the proposed regions from the images and then classify and regress them to obtain the detection result, mainly including the use of algorithms based on Region Convolutional Neural Networks (RCNNs) [
15], Fast RCNN [
16], and Faster RCNN [
17]. There are other two-stage networks that have been improved upon based on the above algorithms, such as Region-based Fully Convolutional Networks (R-FCNs) [
18], Mask R-CNN [
19], and Cascade R-CNN [
20]. Two-stage algorithms can obtain more accurate detection results, but the processing time increases accordingly. Single-stage algorithms improve the detection speed by detecting and localizing the targets directly from the whole image. The main representatives of single-stage algorithms are the SSD [
21] algorithm and YOLO algorithms (YOLO [
22], YOLOv2 [
23], YOLOv3 [
24], YOLOv4 [
25], YOLOv5 [
26]). With continuous upgrades and innovations, the current single-stage target detection algorithms can take the precision of detection into account and guarantee the processing time as well.
Deep-learning-based approaches demonstrate good performance, but they also have some limitations, since the accuracy of detection is influenced by the image quality, and deep-learning approaches are only applicable in waters like those in the training set images. Therefore, it is necessary to combine good image restoration methods and deep-learning algorithms to make underwater target detection more effective. Thomas et al. [
27] created a fully connected convolution neural network for underwater image defogging. By using the depth frame of the encoder-decoder to integrate low- and high-level features, the network was able to effectively restore blurred imageries. A method of recovered images was proposed by Martin et al. [
28]. It combined image enhancement, image recovery, and a convolutional neural network. To address the issue of the maximum number of green pixels in underwater images, they proposed the Under Dark Channel Prior method (UDCP)-based Energy Transmission Restoration (UD-ETR) method to process green-channel images and obtain the recovered images. The Sample-Weighted Hyper Network (SWIPENet) was proposed by Chen et al. in 2020 [
29] to cope with the blurring of underwater images in the context of severe noise interference, the architecture of which comprises many semantic rich and high-resolution hyper feature maps inspired by the Deconvolutional Single Shot Detector (DSSD) [
30]. In [
31], Dana et al. introduced an innovative approach for enhancing the colors of single underwater images. They employed a physical image formation model, distinguishing themselves from previous research. Various Jerlov water types were used to estimate transmission values through a haze-lines model. Ultimately, color corrections were applied using the same physical image formation model, and the optimal outcome was elected from the diverse water types considered.
Weibiao Qiao et al. [
32] introduced a novel method for the real-time and precise classification of underwater targets in 2021. They employed Local Wavelet Acoustic Patterns (LWAP) in conjunction with multi-layer perceptron (MLP) neural networks to tackle the challenges associated with underwater passive target classification, addressing issues related to heterogeneity and classification difficulty. A lightweight deep neural network was introduced in [
33], aiming to simultaneously learn color conversion and target detection from subsea imageries. To mitigate color distortion, an initial step involves employing an image color conversion module to transform color images into grayscale ones. Subsequently, object detection is carried out on the converted grayscale images. This joint learning process involves optimizing a combined loss function. Xuelong Hu et al. incorporated the Pyramid Attention Network (PAN) into Feature Pyramid Networks (FPN) [
34] in [
35], augmenting it to produce a diverse multi-scale feature architecture. This enhanced feature structure was subsequently employed in an uneaten feed pellet detection model tailored for aquaculture applications. Experiments demonstrate a substantial increase in mean average precision (mAP) by 27.21% when compared with the baseline YOLO-v4 method [
25]. To address the challenge of a constrained dataset, Lingcai Zeng et al. [
36] introduced an innovative approach by incorporating an Adversarial Occlusion Network (AON) into the conventional Faster R-CNN detection algorithm. This methodology proved effective in augmenting the training data and enhancing the detection capabilities of the network. Taking inspiration from the shortcut connections observed in residual neural networks [
37], Fang Peng et al. introduced a novel approach, the Shortcut Feature Pyramid Network (S-FPN) in [
38], the primary aim of which is to enhance an existing strategy for multi-scale feature fusion, particularly for holothurian detection.
Through the exploration of enhancement strategies for simulated overlapping, occlusion, and blurred objects, Weihong Lin et al. devised a practical generalized model in [
39], aimed at addressing challenges related to the overlapping, occlusion, and blurring of underwater targets. The Super-Resolution Convolutional Neural Network (SRCNN) is a super-resolution technique that relies on pure convolutional layers [
40]. In the context of underwater imaging in low-light conditions, SRCNN has been utilized to enhance the quality of captured images [
41]. To derive the low-resolution components, the raw data underwent iterative processing involving total variation regularization [
42]. An approach based on YOLO was introduced in [
43], aiming to safeguard rare and endangered species or eradicate invasive exotic species. It was designed to effectively classify objects and count their quantities in consecutive underwater video frames. This involved aggregating object classification outcomes from preceding frames to the current frame. In [
44], Minghua Zhang et al. introduced a Multi-scale Attentional Feature Fusion Module (AFFM) designed to blend semantic and scale-inconsistent features across various convolution layers. In [
45,
46], an innovate method merging multi-scale features across different channels was introduced, which is achieved through various kernel sizes or intricate connections within a single block. This approach enhances the neural network’s capacity for representational learning by emphasizing multi-scale feature fusion within a single block, as opposed to feature fusion through multiple stages of the backbone network.
Li et al. in [
47] introduced the GBH-YOLOv5 model designed for detecting defects in photovoltaic panels. This model incorporates the BottleneckCSP module, which integrates a small target prediction head to improve the detection of smaller features and utilizes the Ghost convolution to optimize inference speed. In the detection of unique fish, Li et al. proposed the CME-YOLOv5 network model in [
48], achieving 92.3% mAP. Wen et al. introduced the YOLOv5s-CA with 80.9% mAP in [
49], a model that improves upon the initial C3 module by integrating a higher quantity of Bottleneck modules. Additionally, it sequentially integrates attention-based CA and SE modules to further enhance the YOLOv5s model. In [
50], Yu et al. introduced an innovative approach called the Multi-Attention Path Aggregation Network (APAN) with 79.6% mAP. This method incorporates a multi-attention mechanism to enhance the precision of detecting multiple underwater objects. Despite the various improved algorithms proposed by researchers, additional research and refinement are needed to evaluate their suitability for intricate and dynamic underwater environments.
Other methods for underwater target detection are also being explored, such as sonar imaging, laser imaging, polarization imaging, and electronic communications. In forward-looking sonar imaging, the geometry, grayscale, and statistical features of some interferences are like those of the targets, which brings about difficulty for target detection. Thus, an underwater linear target detection technology was proposed by Liu [
51], combining Hough transform and threshold segmentation, which can effectively extract linear objects. The research in [
52] introduces a tracking filter designed to combine Ultrashort-base Line (USBL) measurements and acoustic image measurements. This approach could achieve dependable underwater target tracking.
5. Conclusions and Future Work
Unlike conventional terrestrial imagery, underwater imagery is influenced by a diversity of factors, including water turbidity, floating objects in the water, light refraction, absorption, etc., which may result in image distortion or reduced visibility of the target. In underwater applications, smaller and lighter devices are required, and large models may not be applicable. Therefore, it becomes a challenge to find a balance between model size and performance. In summary, the development of underwater target detection approaches is summarized in this present work, including conventional object detection approaches and methods based upon deep learning. Then, in view of the analysis of state-of-the-art underwater sea cucumber detection approaches and aiming to provide a reference for practical underwater identification, adjacent and overlapping sea cucumber detection based on YOLOv5 and DETR, which are examples of one-stage and anchor-free deep leaning methods, respectively, is investigated and compared thoroughly. Detection experiments with these two approaches are deployed on the measured dataset; these demonstrate the outstanding performance of YOLOv5 in accuracy and speed. The results prove that YOLOv5 outperforms DETR in terms of low computing consumption and high precision, particularly in detecting for small and dense features. However, it is worth noting that DETR has shown rapid development of the model and holds promising prospects in underwater object detection applications due to its relatively simple architecture and innovative attention mechanism.
The similarity of all samples is high, so there is no significant difference between the validation set and the training set, which will result in a higher mAP value in the validation set. As a next stage, so as to further deepen the current research, the principal future work is as follows:
Improving detection accuracy and processing time and optimizing the architecture and hyperparameters of both the YOLOv5 and DETR models;
Exploring and evaluating the performances of YOLOv5 and DETR for detecting other marine species;
Developing new data augmentation techniques to expand the diversity and quantity of training data for underwater target detection;
Developing real-time object detection systems using YOLOv5 and DETR and evaluating their performance in practical scenarios.
Introducing appropriate image preprocessing techniques during training and detection to further enhance the performance of the model.