An Improved Underwater Recognition Algorithm for Subsea X-Tree Key Components Based on Deep Transfer Learning

Zhao, Wangyuan; Han, Fenglei; Su, Zhihao; Qiu, Xinjie; Zhang, Jiawei; Zhao, Yiming

doi:10.3390/jmse10101562

Open AccessArticle

An Improved Underwater Recognition Algorithm for Subsea X-Tree Key Components Based on Deep Transfer Learning

College of Shipbuilding Engineering, Harbin Engineering University, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2022, 10(10), 1562; https://doi.org/10.3390/jmse10101562

Submission received: 21 September 2022 / Revised: 13 October 2022 / Accepted: 17 October 2022 / Published: 21 October 2022

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

It is promising to detect or maintain subsea X-trees using a remote operated vehicle (ROV). In this article, an efficient recognition model for the subsea X-tree component is proposed to assist in the autonomous operation of unmanned underwater maintenance vehicles: an efficient network module, SX(subsea X-tree)-DCANet, is designed to replace the CSPBlock of YOLOv4-tiny with ResBlock-D and combine with the ECANet attention module. In addition, two-stage transform learning is used for the insufficiency of underwater target recognition samples as well as the overfitting caused by the subsea target recognition model, thereby providing an effective learning strategy for traditional subsea target recognition. A mosaic data augment algorithm and cosine annealing algorithm are also utilized for better accuracy of network training. The results of ablation studies show that the mean Average Precision (mAP) and speed of the improved algorithm are increased by 1.58% and 10.62%, respectively. Multiple field experiments on the laboratory, experimental pool, and the hydro-electric station prove that the recognition algorithm and training strategy present in this article can be well applied in subsea X-tree component recognition, and can effectively promote the development of intelligent subsea oil extraction projects.

Keywords:

subsea X-tree; one-shot object detectors; efficient channel attention module; two-stage deep transfer learning; field experiment

1. Introduction

Under the trend of economic globalization, light and heavy industries are developing rapidly, and energy demand is increasing in all industries. With the development of onshore oil and gas exploration, the increase in global oil and gas reserves is decreasing as well as the size of oil and gas reservoirs, but the difficulty of exploration is increasing. To meet the demand for oil, marine oil and gas development has become a concern at a time when large oil fields are becoming depleted [1]. Therefore, safe, environmentally friendly, and efficient offshore oil development has become one of the methods and research is focused to solve the problem of decreasing resources at this stage [2]. As the functional portal of the underwater operation system, the subsea X-tree is the key equipment for offshore oil and gas production and downhole operation [3]. Studies on fatigue relief [4] and resistance [5] of subsea X-tree drilling continue to be updated, but most of the scientific experimental focus is on strength and mechanical properties, while the intelligent applications in this field are relatively absent. Therefore, the identification of subsea X-tree key components is the key step for intelligent research in this field. In this study, we carried out a target recognition network, which is based on lightweight and accuracy improvement for the X-tree key daily work environment.

The underwater remote operated vehicle (ROV) plays an important role in X-tree installation and daily maintenance because of its deep-water operation ability and high safety [6]. In addition to the cameras that track the installation and movement of the X-tree in real time, ROVs need to provide tasks such as temporary well-head cover switches, tree locking, unlocking, and subsequent testing. However, the remote-control operation space with low intelligence integration cannot meet the needs of refined tasks. Therefore, intelligent perception assistant operation based on optical vision has become a hot research topic at this stage. The operating system of underwater robots based on an optical visual servo has been applied in increasingly more underwater operation scenarios [6,7,8]. Fatan [9] used a multilayer perceptron and targeted underwater cable identification using AUV for the guidance of underwater fiber optic cable overhaul operations. Han [10] conducted a classification recognition study for benthic organisms on the seafloor to detect and locate primary targets while underwater robots collect seafood. Nevertheless, at this stage, there is still a lack of related research on intelligent recognition algorithms of subsea X-trees.

In addition, although data enhancement is necessary, it is worth considering whether to use digital image preprocessing to compensate for the low quality of seafloor images in the actual recognition process. There are many suspended substances and turbid water, so the obtained underwater images have low contrast, serious color attenuation, and optical scattering phenomena [11]. Underwater image restoration for low illumination, light scattering, and other physical factors has been a hot research topic. Many underwater target recognition applications use many image preprocessing methods to solve the problem of misrecognition caused by environmental distortion [12,13]. However, the limited computing power of the oil tree maintenance robot and the complex underwater environment require a two-stage migration learning strategy, that is, using ImageNet for pre-training to migrate part of the dataset captured on land, and then migrating on the actual experimental dataset. The X-tree component model used in the recognition experiment in this paper is shown in Figure 1:

The overall blurring and cyan tones of the light-vision images taken in the underwater environment make it difficult to extract the location information and feature information. Therefore, the feature extraction ability, recognition accuracy, and recognition speed need to be taken into account in the process of subsea X-tree component identification network design, to be applied to the subsea X-tree maintenance task in the real environment. The main contributions of this paper are as follows:

(1): A target detection algorithm for the subsea X-tree component is proposed, which can be applied to VR operation assistant positioning and robot real-time positioning and mapping. This achievement, which has good performance through comparison, fills the gap in the field of subsea X-tree-related intelligent detection and recognition.
(2): Based on the YOLOv4-tiny Identify Network framework, the backbone framework replaces the original CSPNet with ResNet-D to speed up detection and avoid loss of information due to 1 × 1 convolution and down-sampling.
(3): A more efficient attention mechanism is applied to the two features extracted by the backbone feature extraction module, that is, to optimize the main feature extraction results before obtaining a larger field of perception to ensure more accurate feature information.
(4): A two-stage migration learning training strategy is presented. ImageNet is used to pre-train and migrates to the model dataset captured on land; then, the training results of the previous dataset are migrated to the underwater recognition training task. This efficient training strategy effectively makes up for the problem of the small number of underwater datasets and the single scene, which effectively improves the recognition accuracy.
(5): This paper establishes an underwater oil extraction tree part identification dataset under multiple backgrounds. It is worth mentioning that the subsea X-tree cannot obtain effective image and video information directly, so we build some models of subsea X-tree parts by using 3D printing technology and build part datasets under different backgrounds, including underwater environment backgrounds. Mosaic data enhancement is used to enhance the acquisition of data during the training process.

2. Related Work

As a hotspot for artificial neural networks and machine learning, deep learning has shown great advantages in speech recognition [14,15], text recognition [16,17], pedestrian detection [18,19], etc. Deep learning networks achieve target recognition by transforming feature information through nonlinear transformation to obtain higher-order features of images. The methods based on deep learning can be divided into two categories, One-stage and Two-stage. One stage can directly predict the location and category of different targets through a convolutional neural network, with representative algorithms such as YOLO [20,21] and SSD [22]; the two-stage first generates candidate frames, determines the target location, and then performs classification and regression operations on the candidate frames, with representative algorithms R-CNN [23], Fast R-CNN [24], and Faster R-CNN [25]. Both One-stage and Two-stage are frame-based image recognition algorithms. The box-based image recognition algorithm is the most widely used algorithm in the field of image recognition. The box is divided into the validation box and ground truth box and the validation box can be divided into the prior bounding box and anchor box, where the bounding box will output the specific position of the box (including the coordinates of the center point of the box and the width and height of the box). The bounding box outputs the specific position (including the coordinates of the center point of the box and the width and height of the box), the confidence, and the category, while the anchor box outputs only the width and height of the box. Different algorithms use different boxes to detect the target, so the speed and accuracy of recognition vary.

YOLOv4-tiny has about 6 million parameters, which is a lightweight version of YOLOv4. Their ratio is 1:10. The network structure has 38 layers in total. Among them, three residual units are used. The channel of the feature extraction network is divided by CSPNet, that is, the feature layer channel output after 3 × 3 convolution is divided into two parts. Two effective feature layers are used for target classification and regression, and the feature pyramid (FPN) network is used to merge the effective feature layers. The activation function uses LeakyReLU. Its structure is shown in Figure 2:

YOLOv4-tiny has the main characteristics of multitasking, end-to-end, and multiscale. The network can accomplish both target classification and regression, share parameters, and avoid over-fitting. At the same time, the network will fuse the down-sampled and up-sampled data with each other, which can be used to divide the target into multiple scales. Therefore, YOLOv4-tiny achieves 40.2% AP50, 371 FPS performance on COCO datasets, which is significantly better than other versions of lightweight models.

3. Methodology

3.1. Efficient Channel Attention Module

The ECA module is a variant of the se module, and the basic idea is that the features u are aggregated across the spatial dimensions of the input feature map using an average pooling method, which helps to improve computational efficiency. Figure 3 shows the structure of the efficient channel attention module. The statistic z is generated by contracting u through its spatial dimension H × W such that the cth element of z is computed as:

Z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{J = 1}^{W} u_{c} (i, j)

(1)

s = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{1} δ (W_{0} z))

(2)

δ is the ReLU function and σ is the sigmoid function.

To demonstrate the effect of channel degradation on the effect and the benefit of interaction channel learning, three comparison experiments SE-var1, SE-var2, and SE-var3 are proposed (none of these variable experiments perform channel degradation). Due to the excessive parameters of SE-var3, the authors propose an efficient method for capturing local interaction channel information to reduce the complexity:

ω_{i} = σ (\sum_{i = 1}^{H} w_{i}^{j} y_{i}^{j}), y_{i}^{j} \in Ω_{i}^{k}

(3)

The weight parameter of k × C effectively solves the independent problem of different channel learning, based on a kernel size, i.e., where the mapping relation of k and C is:

C = ϕ (k) = 2^{r * k - b}; k = ψ (C) = {| \frac{l o g_{2} (C)}{r} + \frac{b}{r} |}_{o d d}

(4)

3.2. Replace CSPBlock with ResBlock-D

Because of the duplicate gradient information in the network optimization process, the Cross-Stage Partial Network (CSPNet) is designed to alleviate the problem where the previous work required a large number of inferential calculations from the network architecture perspective. CSPNet ensures the variability of gradients by integrating the beginning and end charts of network phases. As the residual module of YOLOv4-tiny, CSPNet reduces computational effort by 20% while maintaining excellent accuracy. Nevertheless, field tests remind us that the recognition speed still needs to be improved due to the limited computing power of the underwater vehicle. Therefore, the ResBlock-D module is used to replace the original CSPBlock module of YOLOv4-tiny.

ResBlock-D uses a two-path network to process the input feature map. One layer consists of two layers of 1 × 1 convolution with a 3 × 3 convolution held in the middle, and the other layer includes an average pooling and a 1 × 1 convolution. The ResBlock-D module replaces the 3 × 3 convolution layer in the original CSPBlock module by using a 1 × 1 convolution layer, so although two layers are added in Path 2, the reduced computation is still greater than the increased computation. The structure diagram of the CSPBlock and ResBlock-D modules is compared as shown in Figure 4. Floating-point computational complexity illustrates the degree of specific light weight:

F L O P s = \sum_{l = 1}^{D} M_{l}^{2} \times K_{l}^{2} \times C_{l - 1} \times C_{l}

(5)

where D is the sum of all convolution layers,

M_{l}^{2}

is the size of the output feature map of the No. i convolution layers,

K_{l}^{2}

is the number of kernel sizes, and

C_{l - 1}

and

C_{l}

are the number of input and output channels, respectively. Therefore, the complexity of CSPBlock and ResBlock-D modules can be calculated. Assuming the size of the input image is N × N and the number of channels is C, the number of FLOPs of CSPBlock and the number of FLOPs of ResBlock-D Block are as follows, respectively:

{FLOPs}_{CSPBlock} = N^{2} \times 3^{2} \times C^{2} + N^{2} \times 3^{2} \times C \times \frac{C}{2} + N^{2} \times 3^{2} \times \frac{C^{2}}{4} + N^{2} \times 1^{2} \times C^{2} = \frac{67}{4} N^{2} C^{2}

(6)

\begin{matrix} {FLOPs}_{ResBlock - D} & = N^{2} \times 1^{2} \times C \times \frac{C}{2} + \frac{N^{2}}{4} \times 3^{2} \times \frac{C^{2}}{4} + \frac{N^{2}}{4} \times 1^{2} \times C \times \frac{C}{2} + \frac{N^{2}}{4} \times 1^{2} \times C^{2} + C \times 2^{2} \times \frac{N^{2}}{4} \\ = (\frac{23}{16} + \frac{1}{C}) N^{2} C^{2} \end{matrix}

(7)

It can be seen from the calculation formula that CSPBlock is more than ten times more complex than ResBlock-D Block because the number of channels is an integer.

3.3. Network Architecture

Underwater target detection has problems such as low image quality and small detection objects due to the special working conditions, and the common detection methods are generally effective. Our network structure is shown in Figure 5. The image size of the input network is 416 × 416 × 3, which is to make the output image of the network have an odd-number width and height so each feature map has only one central cell when dividing the cell. The image is first passed through two convolution layer modules to obtain an output of 104 × 104 × 64. Each convolution layer module contains a batch normalization layer and LeakyReLU as the activation function after the convolution layer. This is to ensure the speed of the network and to prevent network overfitting, which can lead to training anomalies. Usually, the input is passed through the convolution layer module and the residual module, and the deep semantic information is obtained through the convolution operation; meanwhile, the up-sampling of the middle and back layers of the network is tensor-stitched, to achieve the purpose of multi-scale feature fusion without changing the tensor dimension.In this paper, the backbone feature extraction network reduces the computational complexity by replacing the CSPBlock in the last layer with ResBlock-D. Using the backbone feature extraction network and the attention module, two effective feature layers with scales of 26 × 26 × 256 and 13 × 13 × 512 can be obtained, which are passed to the enhanced feature extraction network to construct the Feature Pyramid Network (FPN). The FPN will up-sample the effective feature layer of the latter scale after inputting it into the convolution module, and then stack and convolve it with the effective feature layer of the previous scale. The target detection effect is improved by fusing the high and low layers. In the feature utilization part, this network extracts multiple feature layers for target detection, and extracts two feature layers with scales of 26 × 26 × 255 and 13 × 13 × 255, outputs the positions of the prediction frames, and then performs screening and non-extreme suppression on the results to obtain the final results.

3.4. K-Means Clustering Anchor Box and Predicted Results Decoded

In SSD and Faster R-CNN, the manually designed anchor box can degrade the detection of the model when it deviates significantly from the target size. In this paper, the algorithm uses the K-means clustering method to train bounding boxes, aiming to replace the manually designed Anchor box by unsupervised learning algorithm analysis. The method aims to divide the dataset into k cluster classes, use the Intersection over Union (IOU) to assign each sample to its closest cluster center, and iteratively update it until the anchor no longer changes, and the optimal box width and height are obtained. Among them, the IOU metric is shown as follows.

I O U (b o x, a n c h o r) = \frac{i n t e r s r c t i o n}{u n i o n - i n t e r s r c t i o n}

(8)

where the expressions for intersection and union are as follows.

i n t e r s e c t i o n = m i n (A n c h o r_w, b o x_w) \times m i n (A n c h o r_h, b o x_h)

(9)

u n i o n = A n c h o r_w \times A n c h o r_h + A n c h o r_w \times A n c h o r_h

(10)

The more similar the boxes, the larger their IOU values. When the boxes and anchor completely overlap, i.e., IOU = 1, the distance between them is 0 and the IOU takes the value between 0 to 1, so the final measurement formula is:

d (b o x, a n c h o r) = 1 - I O U (b o x, a n c h o r)

(11)

Detection is achieved by the output of multiple feature layers and decoding with a decoder to achieve the image prediction frame output. In this case, the feature layer is divided into different sizes of grid points for detection, and the grids are 19 × 19 and 38 × 38, respectively, resulting in prediction results of (N, 19, 19, 3, 85), (N, 38, 38, 3, 85). For each grid cell, the bounding box (bbox) is expected, which includes a parameter x, y corresponding to the position of the center of the prediction box, (H, W) is the width and height of the box, and confidence is the IOU, which is explained in the previous section, as well as the final classification results. We can obtain the center of the prediction box for each grid point corresponding to (x, y), and then use the a priori box and the parameters (H, W) to calculate the length and width of the actual prediction box, to achieve the output of the detection box location information. The decoding process of the prediction results is shown in Figure 6.

3.5. Stochastic Gradient Descent with Restart Algorithm

To improve the convergence speed of the model and avoid adding new hyperparameters, SGD (stochastic Gradient Descent) rather than Gradient Descent method and Mini Batch Gradient Descent method is chosen as the learning rate decay strategy for network training. However, the gradient descent into the local minimum occurs during the training of underwater targets. The saddle point that occurs when the gradient converges will degrade the performance of the model. Therefore, the warm restart method is chosen by this paper to skip falling into the local optimum by increasing the learning rate. The Stochastic Gradient Descent with Restart (SGDR) algorithm introduces cosine annealing as a way of decreasing the learning rate [26].

Warm restarting is started after epochs, which is simulated by increasing the learning rate, used as the initial solution, and the neural network weights are solved by gradient descent to obtain the optimum. The principle of cosine annealing is as follows.

η_{t} = η_{m i n}^{i} + \frac{1}{2} (η_{m a x}^{i} - η_{m i n}^{i}) + (1 + c o s (\frac{T_{c u r}}{T_{i}} π))

(12)

where i refers to the i-th index;

η_{m a x}^{i}

and

η_{m i n}^{i}

denote the maximum and minimum values of the learning rate, respectively;

T_{c u r}

represents the current execution epoch;

T_{i}

denotes the total number of epochs of the I index.

3.6. Transfer Learning and Data Enhance

3.6.1. Two-Stage Transfer Learning

Transfer learning is a machine learning method, whose main idea is to apply the knowledge of related fields to the target field. As most of the data are related to the task, transfer learning is often used to assist the learning of new models. Pan [27] defined the domains and tasks in transfer learning. With the development of deep learning, the idea of transfer learning has been applied to deep learning tasks. Tan [28] proposed the definition of deep transfer learning:

{D_{s}, T_{s}, D_{t}, T_{t}, f (\cdot)}

(13)

when

f (\cdot)

is a nonlinear function of a deep neural network, and the task is a deep transfer learning task, in which

T_{t}

is the learning task in the target domain

D_{t}

and

T_{s}

is the task in the source domain. In most cases, the scale is much larger, and in the case of a given domain D, the task is represented by:

T = {Y, f (\cdot)}

(14)

Generally, considering the relationship between the source domain and target domain, transfer learning methods can be divided into 4 categories [29]: instance-based, feature-based, parameter/model-based, and relation-based. At present, model-based deep transfer learning, that is, pre-training models learned from large benchmark datasets in the source domain and fine-tuned, has become the mainstream learning trend. Using ImageNet [30] for pre-training can bring significant improvement when the initial task data are insufficient, avoid some optimization problems of target data, and shorten the research period. In contrast, the weight of training starting from 0 is too random, and the feature extraction ability will be seriously reduced. Therefore, this paper proposes two-stage transfer learning for the identification of subsea X-tree parts:

(1): The source domain identification task of the non-underwater dataset is carried out, and the key parts model of the underwater oil tree photographed in the common scene is used as the source domain training sample. The network parameters of a trained ImageNet CNN pre-training model are input as initialization parameters to train the target model.
(2): A target domain model that is similar to the source domain model is built.
(3): The key parts of the oil tree in the real underwater test scene are taken as the target domain identification task, and the parameters in the previous pre-training model are taken as the initialization parameters of the target domain model to train the target domain. The specific network dual migration model flow is shown in Figure 7 below.

3.6.2. Data Enhance

The construction and pre-processing of the land-based dataset are key to underwater target recognition. We mirror, rotate, and shear the dataset while simulating the imaging features such as uneven exposure and partial over darkness during underwater filming through the placement of parts, thus weakening the scene degradation phenomenon of underwater recognition. The hues in the real underwater environment are cold and dominated by blue-green tones, so we simulate the hues of real underwater scenes by active image color compensation. This is because the underwater robot has limited computational power during the operation and needs to perform a series of operations such as positioning after the recognition prediction.

The conventional methods of feature enhancement such as flip and color gamut change can effectively enhance the data, but for the scenario applied in this paper. The fixed orientation and chromaticity make it necessary to have a means of data enhancement that can enrich the background of the target object and prevent the network from generalizing due to the similarity of the background of the training set. Mosaic data enhancement is chosen to meet the disadvantage of having a small underwater dataset. The mosaic data method stitches four labeled images to obtain a new image. The mosaic data enhancement method flow is shown in Figure 8.

First, the width and height of the input image (W, H) are used as boundary values to scale the image, and the scaling multiplier is

t_{X}

,

t_{Y}

, which is calculated as follows:

t_{X} = f_{r a n d o m} (t_{W}, t_{W} + Δ t_{W})

(15)

t_{Y} = f_{r a n d o m} (t_{H}, t_{H} + Δ t_{H})

(16)

where

f_{r a n d o m}

represents the random value function;

t_{W}, t_{H}

is the value of the minimum width and height of the image after scaling, respectively;

Δ t_{W}, Δ t_{H}

is the random length, the value of which is between the scaled width and height values.

The coordinates of the upper-left and lower-right corners of the group after image scaling are [(

a_{i}, b_{i}

), (

c_{i}, d_{i}

)], which is obtained from the following equation:

a_{i} = {\begin{matrix} 0, i = 1, 2 \\ W \times r_{1}, i = 3, 4 \end{matrix}

(17)

b_{i} = {\begin{matrix} 0, i = 1, 2 \\ H \times r_{2}, i = 3, 4 \end{matrix}

(18)

c_{i} = a_{i} + W \times t_{W}

(19)

d_{i} = b_{i} + W \times t_{H}

(20)

where r₁ and r₂ are the ratios of the distance between the upper-left coordinate point and O, and the point to the total width of the two groups of images other than the 0 point on the X and Y axes, respectively. From this, the coordinates of the segmentation line can be obtained:

T = {Y, f (\cdot)}

(21)

For partial or complete truncation of the image during the scaling and stitching process, we also need to crop out the real frame corresponding to the target.

4. Results

4.1. Parts Selection, Labeling Strategy, and Test Introduction

In this paper, there are 3654 (3000 + 654) images in the experimental dataset for verifying the recognition algorithm of subsea tree components. The dataset in this paper is produced by a real subsea X-tree parts model through 3D printing (generated by inputting the digital model file of subsea X-tree simulation parts into the printing machine and printing layer by layer using adhesive materials such as powdered metal or plastic). The ratio of its own size to the size of subsea X-tree parts is 5:1, and the color is painted in the same way as the original subsea X-tree. The parts we selected are umbilical cable terminal connector, electric connector, terminal connector, operating handle, set down position bend limiter, and ROV control panel knob.

During the labeling process, different scenes are selected as the background of the dataset, in which a mixture of backlight, low brightness, green background, brown background, brown background, blue background, and other artificially constructed background environments is used, to improve the scene generalization of the training dataset and the robustness of model recognition in complex underwater environments, thus reducing problems such as model distortion due to scene inconsistency.

For the experiments, we use the robot’s upper computer for the recognition task to conduct the experiments, and the specific recognition algorithm carrier hardware configuration parameters are described in the next section. Underwater oil recovery tree parts are lifted into the water using a crane, and the recognition image information is acquired through the optical vision system carried on the ROV. The specific parameters of the robot are shown in Table 1 and Figure 9.

All the tests carried out in this paper are completed in the pool. As the subsequent work development is the underwater VR simulation virtual reality technology, we also carry out the development of the pool/robot/target recognition simulation environment. In addition to guiding the daily operation and maintenance of subsea tree parts, the robot can also be self-positioned and input locating information into the system by identifying targets. Figure 10 illustrates that an experiment in one of the experimental pools used the target recognition result feedback VR system for auxiliary positioning.

4.2. Training configuration and parameters

The hardware configuration is shown in Table 2:

The network training hyperparameters are shown in Table 3.

4.3. Test Records and Comparison of Training Results

This paper describes four experiments in total. The first one is to simply identify the 3D-printed model in the laboratory on land. This work mainly includes two aspects: one is to test the reliability of the basic recognition algorithm, and the other is to establish a land dataset to provide the weight that can be used for transfer learning. Two representative underwater experiments are conducted in different pools and depths to test the recognition ability under different degrees of distortion environments. Then, a recognition work in turbid waters is carried out to verify the generalization ability of the proposed algorithm in underwater workpiece recognition. Ablation tests are shown at the end. The result of applying the land pre-training weight directly to the underwater dataset is not outstanding, but the comparison of different evaluation functions under different epochs is shown in Figure 11, which represents the initial situation of transfer learning when approaching 0 steps. This is similar to the case when displaying experiments, where environmental distortion leads to errors in recognition.

The construction and recognition of terrestrial datasets are relatively simple. Rotation and translation of components can ensure that we capture more angles and, thus, greatly improve the recognition accuracy, which is based on the quality of datasets. We build a dataset of about 12,000 pieces, so the recognition rate is basically 100% in normal scenarios. Figure 12 shows some of the tests on the laboratory, including normal conditions, motion ambiguity, and the presence of shadowing and complex angles. It can be seen that the average recognition rate is still over 97% even if there is motion blur and the distance and angle change.

The reason for the two pool tests is that the imaging conditions of the underwater scattering model are different in different waters or water depths. First, the training is carried out by the previous pre-training model and the marked underwater target, which is applied to the target recognition of the two pools. Different pools and depths result in different distortion effects: background changes, tint changes, limited visual range, etc. Therefore, the weights already stable in the previous part cannot be identified in this environment. At the same time, because of the test environment, we cannot obtain too many underwater datasets. Once the position or angle of the parts is changed, the phenomenon that cannot be recognized will occur. However, there is a good performance in underwater identification using the strategy proposed in this paper. It is worth noting that the transfer learning is not ablated in this paper, because the non-underwater datasets cannot be directly used underwater, that is, the pre-training model obtained in the first phase fails in the initial recognition experiment, and the recognition rate is 0 when the observation distance of the underwater robot is 0.5–1.5 m. Therefore, the test results show that the two-stage migration learning used in this paper has obvious effects.

Because the underwater robot requires real-time identification tasks, the speed of identification is also the focus. The FPS values identified under different models are shown in Table 4, which can prove that the algorithm proposed in this paper is lighter:

Figure 13 shows the results of two different water pool tests, and from the results, it can be seen that the recognition algorithm presented in this paper performs extremely well in the presence of different degrees of imaging blurring or color degradation.

The ablation test in this paper mainly compares with different recognition networks such as SSD and Fast R-CNN series. At the same time, this paper makes a horizontal comparison between our method and the YOLOv4-tiny network and YOLOv4-tiny network with the addition of the attention mechanism module CBMA. Finally, the classical method YOLOv4 is compared. Experimental results show that the proposed target recognition network has absolute advantages when compared with the traditional target recognition framework and the improved target recognition network. Finally, the algorithm in this paper is compared with the classical method YOLOv4. Although this comparison does not consider the factor of speed, the overall performance of SX-DCANet is not as good as YOLOv4, but SX-DCANet still performs better than YOLOv4 in some part recognition tasks. SX-DCANet is faster than traditional algorithms, with an average improvement of 1.31% in precision and 1.58% in mAP. Figure 14 shows the performance of different evaluation function values corresponding to different scoring thresholds on multiple parts studied in this paper. Table 5 shows the average value of evaluation functions of different target recognition models under multiple recognition objects.

5. Discussion

5.1. Field Experiment/Generalization Validation

The field test on a hydro-electric station was carried out the following year in Gongzui reservoir, Sichuan Province. The photo of the field test is shown in Figure 15. Due to a large number of sand mining ships upstream, the water quality of the reservoir is very turbid, which is representative of verifying the generalization of the algorithm proposed in this paper. As a result of site factors, we replaced the workpiece with painted yellow pillars against the background of this test as a base point for calibration of the combined navigation coordinates for ROV operations. Similarly, using the proposed scheme, our algorithm can effectively identify industrial objects in turbid waters and has generalization through experimental verification.

5.2. Future Research Directions

The underwater target recognition algorithm proposed in this paper has been verified to be extremely effective in the experiment of experimental pools and DAMS, that is to say, the lightweight and efficient recognition network can meet the recognition work of underwater robots. Two-stage transfer learning can effectively make up for the problem of too few supervised objects caused by the difficulty of acquiring underwater datasets. However, this technology still has certain directions that can be improved: In the stage of target recognition, deep learning relies on a large amount of data for continuous training to achieve higher recognition accuracy; in other words, if the prior dataset for transfer learning cannot be obtained, weakly supervised learning will become the focus of subsequent research. Although many field tests have been carried out, there are still differences in the simulation of the deep-water environment. In practical applications, the imaging physical model in deep water will seriously affect the quality of the collected image, making the actual target detection results unsatisfactory. Given these deficiencies, further research will be performed in the next work.

6. Conclusions

Thus far, no subsea X-tree parts recognition algorithm has been proposed, and this work is the first underwater recognition of these targets. This paper solves the problem of the difficult acquisition of underwater datasets and low generalization of underwater recognition algorithms due to environmental distortion through multiple experiments and multi-stage transfer learning. An underwater oil tree key component identification method based on depth CNN is proposed. First, ResBlock-D is selected to replace CSPBlock as the middle layer of the feature extraction module to improve the overall lightweight degree of the network. To compensate for the reduced model effect caused by the reduced computation, this paper adds an ECA attention assistance module to the two feature extraction outputs to increase the quality of deep and shallow feature extraction. These changes make the speed of the algorithm proposed in this paper increase by 4.34 fps compared with that before improvement, and the maximum mPA increases by 4.2%. Secondly, K-means is used to cluster bounding boxes of training sets and IOU is used as a measure to automatically generate a set of more suitable anchors for the dataset, which effectively improves the detection effect of the network. More importantly, this paper solves the problem of feature extraction ability degradation caused by random weights through two-stage deep transfer learning. That is, when identifying a network of underwater targets with fewer datasets, the weights of the non-underwater datasets are used as the pre-training model. This method combines the pre-processing mosaic data enhancement algorithm selected in this paper to improve the training efficiency and network feature extraction ability in such recognition tasks. Finally, the validity and capability of the proposed method are verified by non-pool and pool tests, and it is proved that the method is superior to the existing target recognition model in accuracy and speed. This model will be used to assist ROV in detecting underwater targets, promoting intelligent and independent operation and maintenance of subsea X-tree, and providing an accurate relative coordinate system for the VR system in operation.

Author Contributions

Conceptualization, W.Z. and F.H.; methodology, W.Z.; software, Z.S. and X.Q.; validation, W.Z., F.H. and J.Z.; formal analysis, Z.S.; investigation, Z.S.; resources, F.H.; data curation, F.H.; writing—original draft preparation, W.Z.; writing—review and editing, Z.S.; visualization, Y.Z.; supervision, F.H.; project administration, F.H.; funding acquisition, F.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Heilongjiang Province of China, grant number LH2021E047.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Taylor, B.G.S. Offshore oil and gas. Ocean. Shorel. Manag. 1991, 16, 259–273. [Google Scholar] [CrossRef]
O’Dea, A.; Flin, R.H. Site managers and safety leadership in the offshore oil and gas industry. Saf. Sci. 2001, 37, 39–57. [Google Scholar] [CrossRef]
Fenton, S.P. Emerging Roles for Subsea Trees: Portals of Subsea System Functionality. In Proceedings of the Offshore Technology Conference, Houston, TX, USA, 4–7 May 2009. [Google Scholar]
Langis, K.D.; Sattar, J. Real-Time Multi-Diver Tracking and Re-identification for Underwater Human-Robot Collaboration. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 1 June 2020. [Google Scholar]
Nyrkov, A.P.; Sokolov, S.S.; Alimov, O.M.; Chernyi, S.G.; Dorovskoi, V.A. Optimal Identification for Objects in Problems on Recognition by Unmanned Underwater Vehicles. Autom. Control Comput. Sci. 2020, 54, 958–963. [Google Scholar] [CrossRef]
Teng, B.; Zhao, H. Underwater target recognition methods based on the framework of deep learning: A survey. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420976307. [Google Scholar] [CrossRef]
Huang, H.; Tang, Q.; Li, J.; Zhang, W.; Bao, X.; Zhu, H.; Wang, G. A review on underwater autonomous environmental perception and target grasp, the challenge of robotic organism capture. Ocean Eng. 2019, 195, 106644. [Google Scholar] [CrossRef]
Guan, Z.; Hou, C.; Zhou, S.; Guo, Z. Research on Underwater Target Recognition Technology Based on Neural Network. Wirel. Commun. Mob. Comput. 2022, 2022, 1530–8669. [Google Scholar] [CrossRef]
Fatan, M.; Da Liri, M.R.; Shahri, A.M. Underwater cable detection in the images using edge classification based on texture information. Measurement 2016, 91, 309–317. [Google Scholar] [CrossRef]
Han, F.; Yao, J.; Zhu, H.; Wang, C. Marine Organism Detection and Classification from Underwater Vision Based on the Deep CNN Method. Math. Probl. Eng. 2020, 2020, 3937580. [Google Scholar] [CrossRef]
Han, M.; Lyu, Z.; Qiu, T.; Xu, M. A Review on Intelligence Dehazing and Color Restoration for Underwater Images. IEEE Trans. Syst. Man Cybern. Syst. 2018, 50, 1820–1832. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Shen, Z. YOLO-Submarine Cable: An Improved YOLO-V3 Network for Object Detection on Submarine Cable Images. J. Mar. Sci. Eng. 2022, 10, 1143. [Google Scholar] [CrossRef]
Liu, Z.; Zhuang, Y.; Jia, P.; Wu, C. A Novel Underwater Image Enhancement and Improved Underwater Biological Detection Pipeline. J. Mar. Sci. Eng. 2022, 10, 1204. [Google Scholar] [CrossRef]
Hannun, A.Y.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep Speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
Zhu, Z.; Dai, W.; Hu, Y.; Li, J. Speech emotion recognition model based on Bi-GRU and Focal Loss—ScienceDirect. Pattern Recognit. Lett. 2020, 140, 358–365. [Google Scholar] [CrossRef]
Li, L.; Lin, Y.; Zhang, Z.; Wang, D. Improved Deep Speaker Feature Learning for Text-Dependent Speaker Recognition. Comput. Sci. 2015, 426–429. Available online: https://arxiv.org/abs/1506.08349 (accessed on 21 September 2022).
Yu, X.; Dong, M.; Xing, Y.; Chen, Y.; Shu, H.; Xu, W.; Yang, Z.; Hong, Z.; Dong, M. Transformer text recognition with deep learning algorithm. Comput. Commun. 2021, 8, 153–160. [Google Scholar]
Ouyang, W.; Wang, X. Joint Deep Learning for Pedestrian Detection. In Proceedings of the IEEE International Conference on Computer Vision, Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
Xiao, Y.; Zhou, K.; Cui, G.; Jia, L.; Fang, Z.; Yang, X.; Xia, Q. Deep learning for occluded and multi-scale pedestrian detection: A review. IET Image Process 2021, 15, 286–301. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer: Cham, Switzerland, 2016. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. IEEE Comput. Soc. 2013. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. Comput. Sci. 2015. Available online: https://arxiv.org/abs/1504.08083 (accessed on 21 September 2022).
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the ICLR 2017, 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Pan, S.J.; Qiang, Y. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A Survey on Deep Transfer Learning. In Proceedings of the International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018. [Google Scholar]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How Transferable are Features in Deep Neural Networks? MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
Jia, D.; Wei, D.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]

Figure 1. VR simulation of X-tree operation and introduction of key parts. (a) The main body of the X-tree, which is the control channel and monitoring equipment for the production fluid, contains a large number of connectors or buttons. (b) Subsea flexible pipeline component.

Figure 2. Structure diagram of YOLOv4-tiny.

Figure 3. Structure diagram of Efficient Channel Attention Module.

Figure 4. Structure diagram of CSPBlock and ResBlock-D.

Figure 5. Schematic diagram of algorithm network structure proposed in this paper.

Figure 6. Flowchart of prediction result decoding.

Figure 7. Model diagram of two-stage deep transfer learning.

Figure 8. Flowchart of mosaic image data enhancement algorithm. (Four randomly selected images 1, 2, 3 and 4 were cropped and then spliced into a training sample with set side length).

Figure 9. Introduction to the robot structure used in the pool test.

Figure 10. Recognition task application background: used to assist localization in virtual simulation of underwater VR systems.

Figure 11. The corresponding values of different evaluation functions in continuously increasing training epoch.

Figure 12. Pre-training model dataset recognition results: simple identification and testing in the office.

Figure 13. Pool test carried out at different depths in different pools to test the reliability of the algorithms presented in this paper. (a) Depth within 0–1 m, color has changed dramatically and some images have deteriorated. (b) Depth within 1.5–4 m, severe color and pixel attenuation makes it difficult to distinguish the type of part.

Figure 14. Columns 1–3: Different scoring thresholds correspond to different evaluation function values. Column 4: Recall corresponds to Precision. Identification target for each row (from top to bottom): bend limiter, electric connector, ROV control panel knob, set down position, terminal connector, umbilical cable terminal connector, and valve operating handle.

Figure 15. Field test: verify the feasibility and generalization of the proposed scheme through turbid water area of hydropower station.

Table 1. Description of experimental ROV parameters.

ROV	Parameters
Size	450 mm × 340 mm × 280 mm
Weight	8 kg
Thrusters	DC brushless motor × 8
Thruster thrust	Forward push 30 N, reverse 20 N
Maximum speed	2 kn
Camera angle	140° Front view; 70° Right view
Camera/Minimum Illumination	720p binoculars × 1 1080p monocular × 3/0.01 LUX
Operating voltage/rated power	24 V/2000 W
Communication method	Zero buoyancy cable, 100 m
Lighting method	High-brightness LED × 4

Table 2. Hardware configuration.

Software and Hardware Name	Specific Model
CPU	Intel(R) Core(TM) i9-89SOHK @2.90 GHz (12 CPUs)
RAM	32.0 GB RAM
Graphics Card	NVIDIA GeForce GTX 1080
System	Windows 10
Frame	Pytorch-GPU
CUDA Version	9.0
Python version	3.6.5
Software and hardware name	Specific model

Table 3. Training configuration and hyperparameters.

Training Parameters	Num
inputs size	[512, 512]
num classes	7
Anchors mask	[[3, 4, 5], [1, 2, 3]]
	Freeze	Unfreeze
Init epoch	0	0
Interval epoch	50	100
Freeze learning rate	0.0001	0.00001
Batch size	2	2

Table 4. Comparison of the speed of different models.

Algorithm Name	FPS
YOLOv4	22.03
YOLOv4-tiny	40.53
SX-DCANet	44.87

Table 5. Comparison of different evaluation functions under different models.

	SSD	Fast R-CNN	YOLOv4-tiny	YOLOv4-tiny + CBMA	YOLOv4	SX-DCANet
mAP (%)	91.681	93.966	94.953	95.291	96.177	95.893
Recall (%)	87.569	89.877	90.697	90.733	92.449	92.296
Precision (%)	91.279	95.296	96.371	96.587	97.194	96.597
F1-sorce (%)	89.386	92.507	93.448	93.569	94.762	94.398

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, W.; Han, F.; Su, Z.; Qiu, X.; Zhang, J.; Zhao, Y. An Improved Underwater Recognition Algorithm for Subsea X-Tree Key Components Based on Deep Transfer Learning. J. Mar. Sci. Eng. 2022, 10, 1562. https://doi.org/10.3390/jmse10101562

AMA Style

Zhao W, Han F, Su Z, Qiu X, Zhang J, Zhao Y. An Improved Underwater Recognition Algorithm for Subsea X-Tree Key Components Based on Deep Transfer Learning. Journal of Marine Science and Engineering. 2022; 10(10):1562. https://doi.org/10.3390/jmse10101562

Chicago/Turabian Style

Zhao, Wangyuan, Fenglei Han, Zhihao Su, Xinjie Qiu, Jiawei Zhang, and Yiming Zhao. 2022. "An Improved Underwater Recognition Algorithm for Subsea X-Tree Key Components Based on Deep Transfer Learning" Journal of Marine Science and Engineering 10, no. 10: 1562. https://doi.org/10.3390/jmse10101562

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Underwater Recognition Algorithm for Subsea X-Tree Key Components Based on Deep Transfer Learning

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Efficient Channel Attention Module

3.2. Replace CSPBlock with ResBlock-D

3.3. Network Architecture

3.4. K-Means Clustering Anchor Box and Predicted Results Decoded

3.5. Stochastic Gradient Descent with Restart Algorithm

3.6. Transfer Learning and Data Enhance

3.6.1. Two-Stage Transfer Learning

3.6.2. Data Enhance

4. Results

4.1. Parts Selection, Labeling Strategy, and Test Introduction

4.2. Training configuration and parameters

4.3. Test Records and Comparison of Training Results

5. Discussion

5.1. Field Experiment/Generalization Validation

5.2. Future Research Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI