GC-YOLOv5s: A Lightweight Detector for UAV Road Crack Detection

Xiang, Xinjian; Hu, Haibin; Ding, Yi; Zheng, Yongping; Wu, Shanbao

doi:10.3390/app131911030

Open AccessArticle

GC-YOLOv5s: A Lightweight Detector for UAV Road Crack Detection

School of Automation and Electrical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 11030; https://doi.org/10.3390/app131911030

Submission received: 25 July 2023 / Revised: 20 September 2023 / Accepted: 26 September 2023 / Published: 7 October 2023

(This article belongs to the Special Issue Deep Learning in Drone Detection)

Download

Browse Figures

Versions Notes

Abstract

:

This study proposes a GC-YOLOv5s crack-detection network of UAVs to work out several issues, such as the low efficiency, low detection accuracy caused by shadows, occlusions and low contrast, and influences due to road noise in the classic crack-detection methods in the complicated traffic routes. A Focal-GIOU loss function with a focal loss has been introduced in this proposed algorithm, which is applied to address the issue of the imbalance of difficult and easy samples in crack images. Meanwhile, the original localization loss function CIOU is replaced by a GIOU loss function that is more suitable for irregular target (crack) detection. In order to improve the ability of the modified model of representing the features, a Transposed Convolution layer is simultaneously added in place of the original model’s upsampling layer. According to the advantage of computing resources of the Ghost module, the C3Ghost module is applied to decrease the amount of network parameters while maintaining adequate feature representation. Additionally, a lightweight module, CSPCM, is designed with the Conmix module and the Ghost concept, which successfully reduces the model parameters and zooms out the volume. At the same time, this modified module can have enough detection accuracy, and it can satisfy the requirements of UAV detection of small models and rapidity. In order to prove the model’s performance, this study has established a new UAV road-crack-detection dataset (named the UMSC), and has conducted extensive trials. To sum up, the precision of GC-YOLOv5s has increased by 8.2%, 2.8%, and 3.1%, respectively, and has reduced the model parameters by 16.2% in comparison to YOLOv5s. Furthermore, it outperforms previous YOLO comparison models in Precision, Recall, mAP_0.5, mAP_0.5:0.95, and Params.

Keywords:

UAV detection; road crack; complex road environment; small-scale model; UMSC; parameter reduction

1. Introduction

As one of the most serious problems in roadways, road cracks have a major influence on traffic safety and road reliability. After a long-term rainwater infiltration, road cracks can be spread and U-shaped pavement bands can be formed in the passing lane region [1]. Therefore, the above-mentioned issues can make a possible hazard to vehicles and significantly shorten the road’s service life. Highways designed to serve for 15 years will need to be repaired when there are some road cracks after a few years of use, causing great economic losses to the country. Manual examination is time-consuming, labor-intensive, and inefficient in traditional road fracture detection methods [2]. Furthermore, when dealing with large-area road network detection and the influence of complex road elements, traditional methods usually fail to satisfy the demands for real-time and comprehensive coverage. Consequently, it is critical to look for a novel approach that can improve detection efficiency, accuracy, and coverage.

Nowadays, UAVs (Unmanned Aerial Vehicles) have become more and more popular in road detection and monitoring based on the rapid development of drone technology [3]. In comparison to the traditional methods of manual detection and mapping vehicle detection, UAVs have unique advantages. UAVs can fly above the road to collect a wide field of view and multi-angle picture data, in order to supply more comprehensive and detailed information for the detection of road cracks. Furthermore, drones have the characteristics of rapid deployment and flexible operation, allowing for detection under various time and weather conditions, thereby enhancing the real-time and flexibility of the detections. As mentioned previously, the advancements of UAVs and detection methods have been accompanied by continuous innovation in road detection algorithms. Gupta et al. [4] initiated a study on UAV road monitoring to address the problem of traffic accidents and congestion caused by a surge in traffic flow. A YOLOv4 approach was presented to automatically detect traffic components from the perspective of a UAV. It is obvious that YOLOv4 is effective and can overcome the sample imbalance problem of the dataset by means of using a multimodal UAV dataset. For the challenge of small target recognition in UAV-recorded images, Chen et al. [5] designed a DW-YOLO (deeper and wider YOLO) detection model. The designed model based on YOLOv5 optimized the residual blocks and increased the number of convolutional kernels in order to allow the network to learn more complex high-dimensional features. To achieve the best results, experiments were carried out on the open UAV dataset HDrone (Hills Drone). Sun et al. [6] proposed an R4Detector (refined single-stage detector) to solve the problem of multi-scale rotating objects in high-altitude aerial views with enormous aspect ratios, dense distributions, and badly unbalanced classes. The model is based on a single-stage detector with improved feature recursion and focus loss reduction via instance balancing, and it achieved the maximum accuracy on the publicly accessible datasets DOTA (Distributed Object-oriented Targets Dataset) and HRSC2016 (High-Resolution Ships in Complex Scenes). Ma et al. [7] introduced a Transformer-based Object-Oriented Detection (O2DETR) model that uses depth-separable convolution to minimize model parameters and achieves equivalent performance to SOTA (State Of The Art) on the DOTA dataset. Ding et al. [8] proposed a lightweight Rol converter to solve the problem of inconsistency between target classification confidence and localization accuracy for object detection in high-complexity overlooking views, and they implemented it in an RCNN (Region-based Convolutional Neural Network) to demonstrate the module’s effectiveness on the public datasets DOTA and HRSC2016. HemaMalini et al. [9] used a UAV to take photographs of road potholes, after which the authors assessed these images for water injection or not and created a pothole dataset. They then integrated the data into a YOLOv3 model for training and attained an accuracy of 85%. Wang et al. [10] proposed a semantic attention-based mask oriented target frame evaluation method (Mask OBB) for multi-category target detection in overlooking images, which applied a semantic attention network (SAN) to deal with the huge-scale variations of the objects in the high-altitude overlooking images. The objects of interest from cluttered backgrounds were efficiently distinguished. Finally, this method achieved the best results on two public datasets.

Although UAV detection algorithms are developing rapidly, existing detection models in terms of crack detection are too simple to consider the influence of complex environmental factors. On the other hand, the experimental scenarios are too limited, and there is a lack of widely distributed scenario-rich datasets. Consequently, this article focuses on the detection of traffic asphalt pavement cracks by employing DJI UAVs to shoot crack images in a variety of traffic roadway conditions, and the UMSC (Unmanned Aerial Vehicle Multiple Scene Cracks) UAV crack-detection dataset is produced. Therefore, a GC-YOLOv5s (Ghost ConMix- YOLOv5s) model is proposed in this study which uses a Ghost (Ghost Module) to replace the original convolution block to reduce model parameters. At the same time, the proposed model can maintain high accuracy and improve the model’s feature representation by replacing the original up-sampling operation with anti-convolution. The CSPCM (Cross Stage Partial ConMix) module is proposed to improve the context-awareness, and the Focal-GIOU (Focal-Generalized Intersection over Union) loss function is used to address the issue of imbalance between hard and easy samples in crack images. The model’s lightness and validity are validated through a lot of experiments based on our self-collected dataset.

2. Related Work

2.1. Basic YOLOv5 Model

Glenn Jocher’s YOLOv5 model [11] is a regression-based detection of targets model announced in 2020. It was established on target detection models like YOLOv3 [12] and YOLOv4 [13]. When compared to prior models, the YOLOv5 model enhances detection accuracy while preserving detection speed. YOLOv5 comes in four variants: YOLOv5s [14], YOLOv5m [15], YOLOv5l [16], and YOLOv5x [17]. The depth and width of the network are the main differences between these four architectures. The detection accuracy of the model grows as the model parameters increase; however, the detection speed of the model decreases apparently. Because of limited UAV hardware equipment and high requirements for speed, model size constraints are important in real applications such as traffic and road detection situations. Consequently, based on the original YOLOv5 model’s high detection performance, this work selects the YOLOv5s model as the experimental object.

As shown in Figure 1, the YOLOv5s model primarily consists of three components: the backbone network, the neck network, and the head detection network. The backbone network is a convolutional neural network used primarily for feature extraction from image data, generating feature maps at different scales. The backbone network includes modules such as the Focus module, Conv module, C3 module, and Spatial Pyramid Pool (SPP) module [18], and the structure of each module is depicted in Figure 1. The neck network adopts the structures of the Feature Pyramid Network (FPN) [19] and Path Aggregation Network (PAN) [20]. FPN and PAN structures enable the model to better capture object features at different scales, enhancing the accuracy and performance of object detection. Finally, the generated feature maps are fed into the head detection network. The head detection network utilizes techniques like anchor boxes to process the input feature maps, producing the results of object detection, including information about the type, location, and confidence level of detected objects.

2.2. Focus Module

The Focus module is incorporated in YOLOv5 to preprocess the input image to extract richer features for later usage in the backbone network. This module works by slicing the input image and sampling it at each pixel place, similar to neighborhood downsampling. By doing so, the features’ diversity and expressiveness can be improved, resulting in a more accurate and discriminatory feature representation for successive detection of target tasks.

The slicing technique, as seen in Figure 2, produces four complementary but equally sized images while retaining information about the original image. This slicing operation’s goal is to condense width (W) and height (H) information in the channel space. The original image with three RGB channels is enlarged to a new image with 12 channels by stitching these four sliced images together. After processing this new image with a convolution operation, this results in a two-fold downsampled feature map with no information loss.

The Focus operation changes the model’s number and volume of parameters. The model’s volume (Flops) and the number of parameters (Params) are defined as follows:

p a r a m s = K_{h} \times K_{w} \times C_{i n} \times C_{o u t}

(1)

F l o p s = p a r a m s \times W \times H

(2)

where

p a r a m s

is the number of parameters,

K_{w}

and

K_{h}

are the width and height of the convolution kernel, respectively, and

C_{i n}

and

C_{o u t}

are the number of channels in the image after slicing and the number of channels output by the Focus module, respectively,

F l o p s

is the model volume, and

W

and

H

are the width and height of the output feature map, respectively. As demonstrated by Equation (2), the slicing procedure of the Focus module increases the number of parameters by four times as compared to the original convolutional module, and the model volume also increases significantly. Because the GC-YOLOv5s model aims for quick calculation and precise detection, it employs the convolution module rather than the Focus module.

2.3. SPP Module

The SPP module (Spatial Pyramid Pooling) and the SPPF module (Spatial Pyramid Pooling-Fast) are both modules for extracting multi-scale features and are commonly used in target detection and image classification tasks.

The structure of the SPP module is given in Figure 3a, which gathers information from distinct receptive fields by executing pooling operations on the input feature map at different scales. It executes pooling operations at various scales, employing pooling kernels of varying sizes at each scale, and generating a fixed-size feature map at each scale. Pooling produces many fixed-size feature maps, which are spliced together to build a feature representation with multi-scale information. To generate the final output features, the spliced feature maps are subjected to convolution and activation processes. The structure of the SPPF module is shown in Figure 3b, which is similar to the SPP module in that both gather information at multiple scales via multi-scale pooling. The sole difference is that while conducting multi-scale pooling, the SPPF module utilizes the same size pooling kernel, and the output of each pooling becomes the input of the next pooling. This decreases the amount of processing and increases the model’s running performance. As a result, in the GC-YOLOv5s model, the faster operating SPPF module replaces the original SPP module.

2.4. Loss Function

The YOLOv5 loss is made up of three major components. Classification loss, confidence loss, and localization loss. Classification loss and confidence loss are both used with BCE (Binary Cross-Entropy) losses, while localization loss is used as CIoU (Complete Intersection over Union) loss [21], and the overall loss calculation formula is provided in Equation (3):

L o s s = λ_{1} L_{c l s} + λ_{2} L_{o b j} + λ_{3} L_{l o c}

(3)

where

L o s s

is the total loss function,

L_{c l s}

,

L_{o b j}

, and

L_{l o c}

are the categorization, confidence, and localization losses, respectively, and

λ_{1}

,

λ_{2}

, and

λ_{3}

are the equilibrium coefficients of the three types of losses.

Equation (4) shows the formula for the localization loss A:

L_{l o c} = 1 - I O U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(4)

where

b

and

b^{g t}

denote the centroids of the prediction frame and the real frame, respectively.

ρ

denotes the Euclidean distance between the two centroids.

c

denotes the diagonal distance between the minimum closure region of the prediction frame and the real frame.

g t

is an abbreviation of the Ground Truth (the labeled bounding boxes of detected objects that have been annotated manually or by experts), and

I O U

denotes the intersection and concurrency ratio between the prediction frame and the real frame.

α

is a weight parameter with an expression as shown in Equation (5).

v

is used to measure the consistency of the aspect ratio, and its expression is shown in Equation (6):

α = \frac{v}{(1 - I O U) + v}

(5)

v = \frac{4}{π^{2}} {(\arctan \frac{w^{g t}}{h^{g t}} - \arctan \frac{w}{h})}^{2}

(6)

3. Model Modification

3.1. GC-YOLOv5s

In the actual UAV crack-detection process, the crack data have problems such as poor crack continuity, low contrast, low target pixel value, and an extreme imbalance of front background pixels. Furthermore, the designed detection model needs to hold fast computing capability and can be deployed easily under the UAV detection condition. So, we use the YOLOv5s model as the base model. Additionally, to reduce computation and enhance inference performance, this study replaces the Focus module with the Conv module. The Ghost module, which replaces the Conv module of the backbone network, retains superior feature expression with fewer parameters and enhances crack-detection accuracy. Removing the upsampling layer and replacing it with the inverse convolution module increases the image’s resolution and detail information, which aids in precisely locating and detecting cracks, and reduces the number of module parameters while extending the sensory field. Finally, the C3 module in the final layer of the Backbone and Head is replaced by the CSPCM module, which effectively integrates the features of different-scale cracks and provides richer semantic information, as well as further reducing model parameters and model size. Figure 4 shows the network structure of the GC-YOLOv5s model.

The model consists of three main components: the backbone network, the neck network, and the head detection network. It’s worth mentioning that GC-YOLOv5s has some differences in certain detail modules of the backbone and neck parts compared to the original YOLOv5s model. For instance, in GC-YOLOv5s, the Focus module has been removed to reduce parameter computation, and the SPPF module has been introduced to replace the SPP module, enhancing the model’s operational speed. Most importantly, a Ghost Module and a CSPCM module have been proposed to replace the original convolution blocks, reducing both the model’s parameter count and its overall size. The specific structures of the Ghost Module and CSPCM module are shown in Figure 5a,b.

3.2. Ghost and CSPCM

Ghost Module: Inspired by GhostNet, some structural improvements to the Backbone part of the YOLOv5s model were made by replacing the Conv module in the original YOLOv5s model with a Ghost module and naming it GhostCov, introducing the Ghost module as the Bottleneck in the original C3 module, and naming the improved C3 module C3Ghost. The module structure is shown in Figure 5a.

C3Ghost is built on the concept of GhostBottleneck, which improves the model’s feature extraction capability based on the original C3 module. By stacking the GhostCov modules numerous times, it is possible to capture more detailed information in the image and improve the accuracy of crack identification. This is especially important when dealing with issues like poor crack continuity, low contrast, and multiple directions and multiple morphologies.

In terms of computational resources, the addition of the Ghost module provides some structural benefits. The Ghost map is generated using a simple linear operation rather than the original partial convolution operation, and the intrinsic map with less convolution operations superimposed through the residual-like structure with the Ghost map is utilized as the final output. This architecture can make better use of the model’s limited computational resources, increasing its computational speed and efficiency. This is especially essential in UAV application scenarios since UAV road detection situations typically have limited processing resources and high-performance requirements.

CSPCM Module: Inspired by C3Ghost’s achievement in lowering the number of parameters while retaining detection accuracy, this lightweight module design incorporates the concept of residual linkage. Meanwhile, as a major component, the Conmix module is introduced, and the CSPCM module is proposed. This module successfully decreases the number of parameters while maintaining detection accuracy by combining the characteristics of the Conmix module and the benefits of the residual connection.

As shown in Figure 5b, the CSPCM module realizes feature cascading and information transmission by cascading the branch and backbone networks in a cascade operation. The Conmix module is specifically used in the CSPCM module, which uses grouped convolution, residual connection, and 1 × 1 convolution operations for nonlinear enhancement and feature channel tuning. This design extracts a rich feature while reducing feature dimensionality, reducing the number of parameters while preserving detection accuracy.

The CSPCM module successfully combines the idea of C3Ghost with the benefits of the Conmix module to achieve the goal of simultaneously lowering the number of parameters while maintaining detection accuracy in the target detection task. The benefit of this design is that it conserves the model’s storage and computational resources while retaining sufficient expressive capability, resulting in greater efficiency and performance for practical applications.

3.3. Focal-GIOU

In comparison to the classic IOU (Intersection over Union) computation and the original YOLOv5s IOU calculation, GIOU incorporates the Minimum bounding box to tackle the problem of loss equal to zero when the detection frame and the real frame do not overlap. As a result, the GIOU loss function is more stable during training and more resistant to incomplete target detection, which can lower sensitivity to background noise and false detections. This is especially crucial for UAV crack-detection jobs when there is a lot of background noise and interference. Additionally, the GIOU loss function can provide more accurate target detection results, especially in the presence of irregularly shaped targets (e.g., cracks.) GIOU takes into account the geometry of the targets and measures the similarity between the targets by calculating the difference between the intersection and concatenation of the bounding boxes of the targets. Therefore, using the GIOU loss function better adapts to the shape change and scale change in the target and improves the accuracy of crack detection.

The use of Focal-Loss can efficiently deal with the problem of excessive category imbalance in the dataset; for the crack identification task, the samples in the normal region are typically substantially greater than the samples in the fracture zone. By introducing Focal Loss, the model may be made to pay more attention to the difficult-to-categorize crack region and reduce overfitting to the normal region, hence improving the model’s crack-detection performance.

Equations (7)–(10) show the Focal-GIOU formulas:

I O U = \frac{| A \cap B |}{| A \cup B |}

(7)

G I O U = I O U - \frac{| C \ (A \cup B) |}{| C |}

(8)

L_{G I O U} = 1 - G I O U

(9)

L_{F ocal - G I O U} = I O U^{γ} L_{G I O U}

(10)

where

A

is the Ground Truth.

B

is the prediction frame.

C

is the closed enclosing interval between

A

and

B

.

I O U

is the Intersection over Union (IoU) between the Ground Truth and the predicted bounding box.

G I O U

is the metric of the Ground Truth and the prediction frame.

L_{G I O U}

is the loss value.

L_{F ocal - G I O U}

is the loss value with the Focal-loss introduced.

γ

is the hyper-parameter for the curvature of the control curve.

4. Experiments

4.1. Algorithm Structure

As shown in Figure 6, the GC-YOLOv5s algorithm workflow consists of three main components: image preprocessing, image feature extraction, and image post-processing. In Figure 6a, we present the image preprocessing part, which includes image resizing, data augmentation, and image normalization. After image preprocessing, the model proceeds with feature extraction and region prediction, as depicted in Figure 6b. Finally, data go through post-processing steps, including non-maximum suppression and threshold filtering, as illustrated in Figure 6c. Following these post-processing steps, the data are transformed into images and ultimately output the final detection results.

4.2. UMSC

The dataset used in this paper is derived from real-world photographs of road cracks. The shooting tool was a DJI Mavic 3 UAV equipped with a 4/3 CMOS Hasselblad camera with a resolution of 20 megapixels. During the trip, a low flying height range (15–30 m) was used, and a total of 203 road crack photos with a resolution of 5280 × 3956 were captured. A variety of real-world road sceneries were chosen for the filming circumstances, including urban roads, national arterial highways, and interior university roads.

Shadows, occlusions, low contrast, and a lot of road noise are all challenges and confounding variables in the real collected road photos. Furthermore, the photos represent a wide range of crack morphologies, including horizontal cracks, vertical cracks, and block cracks. However, as the research in this paper is devoted to the detection of road cracks rather than the classification of types, all of the different types of cracks are put into the same category, namely, cracks.

Because of the large size of the obtained images, using them directly as input for model learning will result in a slow training speed and a considerable amount of processing resources. To conform to the application scenario that the UAV recognition model should be tiny and fast, the 203 road crack photos were divided into images of size 512 × 512, yielding a total of 12,569 images. From them, 2056 images with cracks were selected for detailed labeling work, and the labeled dataset was named UMSC.

The data sources and collection methods described above ensure that the research presented in this work is carried out in a realistic and diversified road-crack-detection setting, as well as provide a valuable data foundation for subsequent model training and performance evaluation. This dataset allows for a more descriptive understanding of the challenges and requirements of UAV crack detection, as well as a foundation for the creation of more efficient and accurate detection algorithms.

4.3. Experimental Environment

Table 1 depicts the training environment for the GC-YOLOv5s model suggested in this experiment.

4.4. Evaluation Metrics

In this paper, six evaluation metrics are used to judge the algorithm’s effectiveness: precision, recall, mAP_0.5, mAP_0.5:0.95, number of parameters (Params), and model volume (Flops). The first four types of indexes are used to judge the detection effect of the model, and the number of parameters and model volume are used to judge the size of the model and the operational efficiency, and the formulas for the various types of indexes are shown in Equations (11)–(14).

P = \frac{T P}{T P + F P}

(11)

R = \frac{T P}{T P + F N}

(12)

A P = \int_{0}^{1} P (R) d R

(13)

m A P = \sum_{i = 1}^{M} \frac{A P_{i}}{M}

(14)

where

P

is Precision,

R

is Recall,

T P

is the number of positive samples correctly predicted as positive by the model,

F P

is the number of negative samples incorrectly predicted as positive by the model, and

F N

is the number of positive samples incorrectly predicted as negative by the model.

A P

is the Average Precision,

m A P

is the average across all categories, and

M

is the number of categories.

Equation (7) shows the formula for

I O U

, and the mapping is referred to as mAP_0.5 if the

I O U

threshold is set to 0.5, and mAP_0.5:0.95 if the B-threshold is set to 0.95. Additionally, as stated in Equations (1) and (2), the model parameter (

p a r a m s

) and model volume (

F l o p s

) are determined.

4.5. Experimental Results

Ablation tests were conducted on each module to verify the validity of each module in GC-YOLOv5s, and the results are reported in Table 2. Among them, GC-YOLOv5s-A is the version that replaces CIOU with Focal-GIOU on top of the original YOLOv5s, GC-YOLOv5s-B is the version that introduces the deconvolution on top of GC-YOLOv5s-A, GC-YOLOv5s-C is the version that introduces Ghost and C3Ghost modules on top of GC-YOLOv5s-B, and GC-YOLOv5s is the version that adds the CSPCM module to GC-YOLOv5s-C.

The results show the fact that Precision, Recall, mAP_0.5, and mAP_0.5:0.95 measures of GC-YOLOv5s-A are superior to those of YOLOv5s, which suggests that the Focal-GIOU loss function successfully reduces the problem of sample imbalance in crack images. Although the Recall and mAP_0.5:0.95 metrics decreased slightly in GC-YOLOv5s-B compared to GC-YOLOv5s-A, the Precision and mAP_0.5 metrics increased significantly, which indicates that the introduction of reverse convolution improves the model’s ability to express features while also increasing the number of model parameters and the model volume. Precision, mAP_0.5, and mAP_0.5:0.95 metrics are greatly improved when compared to GC-YOLOv5s-B after the introduction of the Ghost module and the C3Ghost module to replace the original convolutional layer and C3 module, while the number of parameters and model volume is reduced. This proves that, despite the Ghost module’s use of a low-cost convolutional operation, the operation of shedding redundant features resulted in a superior feature representation of the model.

In this research, the suggested model (GC-YOLOv5s) outperformed the GC-YOLOv5s-C in terms of Precision and mAP_0.5:0.95. However, Recall and mAP_0.5:0.95 decreased slightly, which can be attributed to the variable responsiveness of CSPCM to specific cracks in different environments. Nevertheless, the addition of CSPCM results in an 8.2% reduction in the original parameter count and a 0.6 reduction in model size, making the model more compact and easier to deploy.

Last but not least, the ablation tests show that inserting different improvement modules can considerably increase the performance of the UAV crack-detection model. Precision, recall, mAP_0.5, and mAP_0.5:0.95 measures all improved significantly in the final experiment. These findings give strong support for the study presented in this work as well as significant insights for future UAV-crack-detection development and implementation.

To further evaluate the algorithm’s performance, this research conducts a comparative experiment using the same training approach, the same equipment, and the same conditions on the UMSC dataset. Six metrics, including Precision, Recall, mAP_0.5, mAP_0.5:0.95, Params, and Flops, are used to assess the effectiveness of six different sizes and types of YOLO models, including GC-YOLOv5s, YOLOv3, YOLOv3-tiny, YOLOX-s, YOLOv7, YOLOv7-tiny, and others. Table 3 summarizes the evaluation outcomes.

The experimental results show that, when compared to YOLOv3, GC-YOLOv5s improves Precision and Recall metrics by 2.1% and 0.9%, respectively, while the mAP_0.5 and mAP_0.5:0.95 metrics improve by 4.8% and 5%; furthermore, the number of parameters and model volume is much better. GC-YOLOv5s outperforms YOLOv3-tiny by 26.8%, 6%, 19.5%, 21.7%, and 32.2% in five metrics, including Precision, Recall, mAP_0.5, mAP_0.5:0.95, and Params, but with a 3.8 unit increase in model volume. GC-YOLOv5s outperforms YOLOX-s by 3.5%, 5.4%, and 2.7% in three metrics, including Precision, mAP_0.5, and mAP_0.5:0.95, additionally by a bigger margin in Params, which decreases the original model by 27% while lowering model volume by 4.8 units. GC-YOLOv5s outperforms YOLOv7 in all six metrics, with lesser increases in three metrics, Recall, mAP_0.5, and mAP_0.5:0.95, with a larger improvement in Precision of 7.6%, and a much lower number of parameters and model volume. When compared to YOLOv7-tiny, GC-YOLOv5s has a 6.2%, 1.2%, 4.2%, 5.5%, and 2.5% improvement in five metrics, including Precision, Recall, mAP_0.5, mAP_0.5:0.95, and Params; however, the model volume has grown by 3.6 units.

In summary, as compared to the bigger YOLOv3 and YOLOv7 models, the GC-YOLOv5s achieves higher results in all metrics and has significant advantages in terms of parameter number and model volume. GC-YOLOv5s outperforms YOLOv3-tiny, YOLOX-s, YOLOv7-tiny, and other models with comparable volumes in all metrics while outperforming the models in terms of the number of parameters. These findings demonstrate the method’s significant promise in the field of UAV crack detection and give strong data to support its practical implementation.

To test the target detection effect of the algorithms proposed in this study in real scenes, some complex and representative images from the UMSC dataset are chosen for detection, and the detection effect is shown in Figure 7, which includes the detection result of the algorithms in this paper and YOLOv3, YOLOv3-tiny, YOLOX-s, YOLOv7, and YOLOv7-tiny, respectively. The first column is the Ground Truth image, and each column after that is the detection result graph of each technique.

As shown in Figure 7a, YOLOv3, YOLOv3-tiny, and YOLOX-s have false detection problems due to light interference from external environmental objects in the scene, whereas the algorithm of this paper and YOLOv7 and YOLOv7-tiny do not; however, the accuracy of this paper’s algorithm for crack detection is higher than that of YOLOv7 and YOLOv7-tiny. The image contains several cracks with distinct extension directions, as seen in Figure 7b, as well as a lot of noise. YOLOv3 and YOLOv7-tiny have the problem of missed detection, and YOLOv3-tiny and YOLOX-s have the problem of false detection; only the model in this paper and the model of YOLOv7 have the more accurate detection effect; however, the effect of this paper’s algorithm is also better than the YOLOv7 algorithm, as shown in Figure 7c,d. Figure 7c has the least amount of environmental interference, and none of the algorithms have missed detection or false detection issues. However, for models such as YOLOv3-tiny, YOLOv7, and YOLOv7-tiny, the light reflection problem in Figure 7d leads to the problem of the repetitive detection bounding box. The algorithms in this study outperform other YOLO-type techniques in terms of crack-detecting accuracy. In conclusion, the GC-YOLOv5s algorithm is capable of better overcoming the influence of the complex environment and achieving accurate identification of multi-scale cracks.

5. Conclusions

A deep learning model is applied to the UAV road crack-detecting scenario in this work. The GC-YOLOv5s UAV crack-detection model is offered as a tiny and easily deployable UAV crack-detection model. Based on the original YOLOv5s, the model improves the loss function and backbone network and proposes two new modules, C3Ghost and CSPCM, to replace the original C3 module to improve the detection effect of the network and reduce the number of network parameters and model volume, making it more suitable for UAV detection scenarios. The UMSC dataset of road cracks photographed by UAVs is then developed, which covers three types of widely spread roads and includes the UAV detection issues of shadows, occlusions, low contrast, and huge levels of road noise. Finally, the validation is carried out based on this dataset in comparison with various types of algorithms. The results show that the GC-YOLOv5s model improves Precision by 8.2%, mAP_0.5 by 2.8%, mAP_0.5:0.95 by 3.1% and decreases the number of parameters by 16.2% compared with the pre-improvement. The GC-YOLOv5s model outperforms other models in Precision, Recall, mAP_0.5 and mAP_0.5:0.95, and the number of parameters decreases by 2.4% to 32% compared with other models of the same volume.

The GC-YOLOv5s crack-detection network suggested in this study is novel in terms of improving the loss function, optimizing the network topology, and lightweight the module design, yielding more satisfactory results. This crack-detection technology provides significant additional value to the field of road maintenance and safety. It can be utilized in automated road maintenance and monitoring systems, greatly reducing the workload of maintenance personnel. The automatic detection and localization of road cracks contributes to the early identification and mitigation of potential road hazards, thereby reducing the risk of traffic accidents. Furthermore, it can be employed in planning road maintenance activities, enhancing the lifespan of road infrastructure, and saving maintenance costs. From a practical perspective, this technology offers substantial assistance to traffic management authorities, urban planners, and infrastructure maintenance companies. It enables them to respond more swiftly to road issues and take timely measures, thereby improving road availability and safety. With the broad and diverse development of UAV detection; however, tiny and more generalized models must still be researched. In order to establish a comprehensive UAV road defect detection system, we plan to continue researching multi-category defect detection systems in the future, including various road issues, among which are cracks. We will also make efforts to reduce the model’s size to accommodate the expanding detection categories while enhancing computational efficiency. Additionally, we recognize that real-world applications may be influenced by varying road conditions and environmental factors, which could impact detection performance. Therefore, we will delve into optimizing and refining our model under diverse scenarios.

Author Contributions

Conceptualization, Y.Z. and Y.D.; methodology, X.X.; software, H.H.; validation, H.H., S.W. and Y.D.; formal analysis, X.X.; investigation, H.H.; resources, X.X.; data curation, H.H.; writing—original draft preparation, H.H.; writing—review and editing, S.W.; visualization, H.H.; supervision, Y.Z.; project administration, Y.D.; funding acquisition, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Open Foundation of the Key Laboratory of Intelligent Robot for Operation and Maintenance of Zhejiang Province (SZKF-2022-R04), Zhejiang University of Science and Technology 2022 postgraduate research innovation fund projects (2022yjskc06), Zhejiang Provincial Natural Science Foundation (LY19F030004), Zhejiang Provincial Department of Transportation Science and Technology Plan Project (202206).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

At this point, we would like to express our gratitude to Cao Guangke for the contributions he has made to our team, including initiating the project and providing financial support. Throughout the project, Cao Guangke has offered guidance and provided us with the necessary data platform support. As a driving force behind our project, Cao Guangke has played a crucial role. Once again, we sincerely thank him for his dedication.

Conflicts of Interest

The authors declare no conflict of interest.

References

De Long, Z. A brief discussion on the hazards of road cracks and prevention measures. Transp. Sci. Technol. Econ. 2006, 5, 68–70. [Google Scholar]
Li, Y.; Ma, J.; Zhao, Z.; Shi, G. A Novel Approach for UAV Image Crack Detection. Sensors 2022, 9, 3305. [Google Scholar] [CrossRef] [PubMed]
Alkaabi, K.; El Fawair, A.R. Application of A Drone camera in detecting road surface cracks: A UAE testing case study. Arab. World Geogr. 2021, 24, 221–239. [Google Scholar]
Gupta, H.; Verma, O.P. Monitoring and surveillance of urban road traffic using low altitude drone images: A deep learning approach. Multimed. Tools Appl. 2022, 81, 19683–19703. [Google Scholar] [CrossRef]
Chen, Y.; Zheng, W.; Zhao, Y.; Song, T.H.; Shin, H. Dw-yolo: An efficient object detector for drones and self-driving vehicles. Arab. J. Sci. Eng. 2023, 48, 1427–1436. [Google Scholar] [CrossRef]
Sun, P.; Zheng, Y.; Zhou, Z.; Xu, W.; Ren, Q. R4 Det: Refined single-stage detector with feature recursion and refinement for rotating object detection in aerial images. Image Vis. Comput. 2020, 103, 104036. [Google Scholar] [CrossRef]
Ma, T.; Mao, M.; Zheng, H.; Gao, P.; Wang, X.; Han, S.; Doermann, D. Oriented Object Detection with Transformer. arXiv 2021, arXiv:2106.03146. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. Comput. Vis. Pattern Recognit. (CVPR) 2019, 2844–2853. [Google Scholar] [CrossRef]
HemaMalini, B.H.; Padesur, A.; Kumar, M.; Shet, A. Detection of Potholes on Roads using a Drone. EAI Endorsed Trans. Energy Web 2022, 9, e4. [Google Scholar]
Wang, J.; Ding, J.; Guo, H.; Cheng, W.; Pan, T.; Yang, W. Mask OBB: A semantic attention-based mask oriented bounding box representation for multi-category object detection in aerial images. Remote Sens. 2019, 11, 2930. [Google Scholar] [CrossRef]
Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. arXiv 2021, arXiv:2108.11539. [Google Scholar]
Huang, M.; Zhang, Y.; Chen, Y. Small Target Detection Model in Aerial Images Based on TCA-YOLOv5m. IEEE Access 2022, 11, 3352–3366. [Google Scholar] [CrossRef]
Nan, Y.; Zhang, H.; Zeng, Y.; Zheng, J.; Ge, Y. Faster and accurate green pepper detection using NSGA-II-based pruned YOLOv5l in the field environment. Comput. Electron. Agric. 2023, 205, 107563. [Google Scholar] [CrossRef]
Zhang, J.L.; Su, W.H.; Zhang, H.Y.; Peng, Y. SE-YOLOv5x: An optimized model based on transfer learning and visual attention mechanism for identifying and localizing weeds and vegetables. Agronomy 2022, 12, 2061. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. arXiv 2016, arXiv:1612.03144. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Structure of the YOLOv5s model.

Figure 2. Diagram of Focus slicing operations.

Figure 3. Comparison of the SPP Module and the SPPF Module.

Figure 4. Diagram of the network structure of the GC-YOLOv5s model.

Figure 5. Diagram of the structure of the Ghost model and CSPCM model.

Figure 6. Workflow diagram of the GC-YOLOv5s algorithm.

Figure 7. Diagram of the Effectiveness of GC-YOLOv5s and other algorithms.

Table 1. Experimental environment configuration table.

Configuration Name	Configuration Parameters
Operating System	Ubuntu 18.04.6 LTS
CPU	Intel Xeon(R) Silver 4210 CPU @ 2.20GHz × 40
GPU	NVIDIA GeForce RTX 2080 Ti/PCIe/SSE2
Memory	125.6GiB
Software	Anaconda, Pycharm2021
Deep Learning Framework	Pytorch 1.13.1
GPU Acceleration Library	CUDA 11.6

Table 2. GC-YOLOv5s effect under each module.

UAV Detection Model	Performance Metrics
UAV Detection Model	Precision	Recall	mAP_0.5	mAP_0.5:0.95	Params	Flops
YOLOv5s	0.687	0.707	0.715	0.415	7,012,822	15.9
GC-YOLOv5s-A	0.708	0.732	0.724	0.419	7,012,822	15.9
GC-YOLOv5s-B	0.734	0.725	0.741	0.41	8,323,926	22.5
GC-YOLOv5s-C	0.764	0.712	0.752	0.436	6,399,182	17.4
GC-YOLOv5s	0.769	0.698	0.743	0.446	5,873,918	16.8

Table 3. Comparative experimental results of various models.

UAV Detection Model	Performance Metrics
UAV Detection Model	Precision	Recall	mAP_0.5	mAP_0.5:0.95	Params	Flops
YOLOv3	0.748	0.689	0.695	0.396	61,497,430	154.5
YOLOv3-tiny	0.528	0.638	0.548	0.229	8,669,876	13.0
YOLOX-s	0.734	0.698	0.689	0.419	8,040,626	21.6
YOLOv7	0.693	0.694	0.718	0.433	36,479,926	103.2
YOLOv7-tiny	0.707	0.686	0.701	0.391	6,014,988	13.2
GC-YOLOv5s	0.769	0.698	0.743	0.446	5,873,918	16.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiang, X.; Hu, H.; Ding, Y.; Zheng, Y.; Wu, S. GC-YOLOv5s: A Lightweight Detector for UAV Road Crack Detection. Appl. Sci. 2023, 13, 11030. https://doi.org/10.3390/app131911030

AMA Style

Xiang X, Hu H, Ding Y, Zheng Y, Wu S. GC-YOLOv5s: A Lightweight Detector for UAV Road Crack Detection. Applied Sciences. 2023; 13(19):11030. https://doi.org/10.3390/app131911030

Chicago/Turabian Style

Xiang, Xinjian, Haibin Hu, Yi Ding, Yongping Zheng, and Shanbao Wu. 2023. "GC-YOLOv5s: A Lightweight Detector for UAV Road Crack Detection" Applied Sciences 13, no. 19: 11030. https://doi.org/10.3390/app131911030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GC-YOLOv5s: A Lightweight Detector for UAV Road Crack Detection

Abstract

1. Introduction

2. Related Work

2.1. Basic YOLOv5 Model

2.2. Focus Module

2.3. SPP Module

2.4. Loss Function

3. Model Modification

3.1. GC-YOLOv5s

3.2. Ghost and CSPCM

3.3. Focal-GIOU

4. Experiments

4.1. Algorithm Structure

4.2. UMSC

4.3. Experimental Environment

4.4. Evaluation Metrics

4.5. Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI