Next Article in Journal
Heat Transfer Enhancement of Nanofluids with Non-Spherical Nanoparticles: A Review
Next Article in Special Issue
Detection and Monitoring of Pitting Progression on Gear Tooth Flank Using Deep Learning
Previous Article in Journal
Complex-Order Models: A System Identification Point of View
Previous Article in Special Issue
Deep Learning-Based Occlusion Handling of Overlapped Plants for Robotic Grasping
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Object Detection Algorithm for Wheeled Mobile Robot Based on an Improved YOLOv4

School of Computer Science and Engineering, Changchun University of Technology, 3000 North Yuanda Avenue Gaoxin North District, Changchun 130000, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(9), 4769; https://doi.org/10.3390/app12094769
Submission received: 7 April 2022 / Revised: 5 May 2022 / Accepted: 6 May 2022 / Published: 9 May 2022
(This article belongs to the Special Issue Computer Vision in Mechatronics Technology)

Abstract

:
In practical applications, the intelligence of wheeled mobile robots is the trend of future development. Object detection for wheeled mobile robots requires not only the recognition of complex surroundings, but also the deployment of algorithms on resource-limited devices. However, the current state of basic vision technology is insufficient to meet demand. Based on this practical problem, in order to balance detection accuracy and detection efficiency, we propose an object detection algorithm based on a combination of improved YOLOv4 and improved GhostNet in this paper. Firstly, the backbone feature extraction network of original YOLOv4 is replaced with the trimmed GhostNet network. Secondly, enhanced feature extraction network in the YOLOv4, ordinary convolution is supplanted with a combination of depth-separable and ordinary convolution. Finally, the hyperparameter optimization was carried out. The experimental results show that the improved YOLOv4 network proposed in this paper has better object detection performance. Specifically, the precision, recall, F1, mAP (0.5) values, and mAP (0.75) values are 88.89%, 87.12%, 88.00%, 86.84%, and 50.91%, respectively. Although the mAP (0.5) value is only 2.23% less than the original YOLOv4, it is higher than YOLOv4_tiny, Eifficientdet-d0, YOLOv5n, and YOLOv5 compared to 29.34%, 28.99%, 20.36%, and 18.64%, respectively. In addition, it outperformed YOLOv4 in terms of mAP (0.75) value and precision, and its model size is only 42.5 MB, a reduction of 82.58% when compared to YOLOv4’s model size.

1. Introduction

Object detection can be regarded as target positioning and image classification. There are usually only one or a fixed number of targets in target positioning, whereas the type and quantity of objects in the object detection image are not fixed.
Before 2013, the majority of object detection relied on the manual extraction of features. By creating complicated models and multi-model integration based on low-level feature expression, most people gradually improved detection accuracy. The researchers [1,2,3,4,5,6,7] noticed that CNN was able to learn characteristic representations with robustness and certain expressive power when it was shown in the 2012 ILSVRC Image Classification project. Thus, Girshick et al. proposed the regions with CNN features (R-CNN) model [8] in 2014. From that point on, object detection research accelerated at an unparalleled rate. R-CNN makes a significant contribution by introducing deep learning into object detection, boosting the mAP on the Pascal VOC 2007 dataset from 35.1% to 66.0%. Because the input and output sizes of the entire connection layer after the convolutional layer are fixed when R-CNN transmits the candidate region to the CNN, the CNN likewise requires a fixed input size, limiting the size of the input image adjusted arbitrarily. Furthermore, since candidate regions frequently overlap, the approach of feeding each candidate region into CNN results in a significant number of repeated calculations. In 2014, Kaiming He et al. proposed the SPP-Net [9] (space pyramid pooling network) as a solution to these two issues. Girshick et al. proposed an enhanced version of the fast R-CNN [10] based on the R-CNN in 2015. Similar to the SPP layer, the fast R-CNN pools a RoI (region of interest) layer. For synchronization training, SPP adds the classification and bounding box regression phases to a deep network. While the fast R-CNN increases speed and accuracy, it also requires the use of external algorithms to extract target candidate boxes. Immediately after the fast R-CNN was proposed, Ren Shaoqing et al. proposed the faster R-CNN model [11] which combines the process of extracting an object candidate box into a deep network. The faster R-CNN model is the first deep learning object identification method that is really end-to-end, as well as the first deep learning object detection algorithm that is quasi-real-time. The design of a regional candidate network is R-CNN’s significant innovation (RPN). In 2016, Dai Jifeng et al. introduced the R-FCN (region-based fully convolutional network) model [12], where location-sensitive RoI pooling was added to improve object detection accuracy. They are all detection methods based on candidate areas in object detection, from R-CNN to R-FCN. In most cases, there are two steps to the implementation. The two-stage target identification approach involves extracting deep characteristics from an image and calculating candidate regions, followed by locating each candidate region (including classification and regression). Despite its excellent detecting accuracy, there is still a delay between speed and real-time.
The researchers proposed a one-stage object detection technique to enable object detection to satisfy real-time requirements. The approach of “coarse detection + refinement” is abandoned and replaced by the method of “anchor point + correction” (this method only performs one feedforward network calculation, which is very fast and can achieve real-time effects). Joseph and Girshick et al. introduced the first one-stage object detection algorithm called YOLO [13] (you only look once) in 2015. The introduction of YOLO has revealed fresh information on the speed, which can now approach 45 frames per second. However, the limitations are clear, namely, demonstrating the limited number of network energy detection targets and the poor small object detection effect. Wei Liu et al. proposed the SSD [14] method in the same year. SSD incorporates YOLO’s quick detection concept, merges RPN’s advantages in faster R-CNN, and improves multi-size target processing. Most previous methods create predictions using semantically rich high-level convolutional features; however, the high-level features will result in the loss of some detailed information, and the target location will be unclear. Tsung-Lin et al. proposed the FPN [15] (feature pyramid networks) at the end of 2016. FPN improves the network processing capabilities for small targets by combining high-level, low-resolution, strong semantic information with low-level, high-resolution, weak semantic information. The FPN can be used in conjunction with either a one-stage or two-stage object detection approach. At the same time, YOLO released YOLOv2 [16], which added a batch normalization layer for each convolutional layer to increase convergence while removing the fully connected layer from the network.
Although one-stage object detection is significantly faster than two-stage object detection, the detection accuracy is not as good as two-stage object detection. As a result, in 2017, Tsung-Yi Lin proposed the RetinaNet [17] detection model. The focus loss function, proposed by RetinaNet, decreases the learning weight of simple background samples during network training. Zhang Shifeng et al. proposed the RefineDet [18] in late 2017. The ARM (anchor refinement mould), the ODM (object detection module), and the TCB (transfer connection block) are all proposed by RefineDet. YOLOv3 [19] was updated in early 2018, replacing the softmax function with numerous independent classifiers for multi-dimensional prediction, comparable to the feature pyramid network.
Even though both two-stage and one-stage object detection algorithms have made significant progress on high-performance devices, they need significant processing resources to run. As a result, these algorithms are challenging to implement on some resource-constrained devices. Some lightweight object detection algorithms, such as YOLOV3_tiny [20], YOLOV4_tiny [21], Mobilenetv1 [22], Mobilenetv2 [23], Mobilenetv3 [24], and others [25,26,27,28,29,30] have been offered for this problem. However, the identification accuracy suffers as a result of the network model’s compression, making it impossible to detect some complicated sceneries and small target items.
With the continuous development of object detection technology and robotics, several object detection methods are deployed to robots and applied to various fields, for example. agriculture [31,32,33,34,35,36], industry [37,38,39,40], health [41], unmanned vehicles [42], sports [43,44], and so on.
The RoboMaster AI Challenge brings together deep neural network enthusiasts from all around the world to research robotics. It is also envisaged that the findings would be applied to field rescue, unmanned driving, autonomous logistics, and other industries in order to benefit human life. A wheeled mobile robot should be able to recognize, track, and shoot targets in the RoboMaster AI Challenge. The key to success is figuring out how to make the object detection algorithm on the mobile robot ensure detection accuracy. This paper takes this task as the specific application context for the study.
The main contributions of our work can be summarized as follows:
(1)
This study proposes an improved YOLOv4 object detection algorithm that replaces the YOLOv4 backbone extraction network with a trimmed GhostNet lightweight network. This algorithm improves performance over the YOLOv4-related algorithm while maximizing accuracy retention.
(2)
This study uses a combination of deep separable convolution and ordinary convolution to improve ordinary convolution, which significantly reduces the number of parameters and makes the improved network more lightweight and efficient, suitable for deployment on resource-constrained mobile robotic devices, with great potential for applications.
(3)
To demonstrate the feasibility of the algorithm, we experimentally validate the object detection using a wheeled movable robot in the context of DJI’s RoboMaster AI challenge application.
(4)
The experimental dataset in this paper is from DJI. This dataset has relatively complex scenes, identifies more small target objects, and the official dataset is more valuable for research. We also design a method to enhance the dataset by image processing and demonstrate experimentally that it can effectively improve the detection accuracy.

2. Materials and Methods

2.1. Experimental Device

The robot platform used in this paper is the generic omnidirectional wheeled mobile robot development platform created by the ICRA RoboMaster AI Challenge, as shown in Figure 1. There are five modules in total: chassis, expansion, cloud platform, launch, and referee system. To implement the robot’s movement, the robot chassis module has a set of McNham wheels (the McNham wheel is a type of omnidirectional wheel that allows for omnidirectional movement in both XY directions as well as for combined movement); the Robotics Expansion Module is an expansion platform mounted on a chassis module with an external controller (two devices are used in this paper, the Manifold 2-G and the Dell OptiPlex 3060 (the Dell OptiPlex 3060 is manufactured by Dell and originates from China), and their specific parameters are shown in Table 1); the robot head module can complete pitch and yaw rotation motion, flexibility, and confrontation; the robot launch mechanism module can launch a RoboMaster 17 mm projectile; users can use it flexibly to create a unique automatic robot solution. We chose the Mercury Series 1.31 MP MER-131-75GM/GC DaHeng Image Industrial Camera as an experimental camera.

2.2. Object Detection Algorithm Design

2.2.1. YOLOv4 Structure

The YOLOv4 [45] network is based on the YOLOv3 network and showed significant improvements in identifying small and blurred targets, with AP and FPS values increasing by 10% and 12%, respectively. YOLOv4 enriches the data on the input side by randomly resizing, cropping, and aligning the four photographs. This greatly enhanced the robustness of the network, greatly enriched the detection dataset and increased the number of small targets. YOLOv4 builds the CSPDarknet53 in the main network section, and its structure splits Darknet53 into two parts: the backbone part continues the original stack of residual blocks; the other part is connected directly to the end similar to a residual edge with a small amount of processing. Its structure is shown in Figure 2, and it modified the activation function of DarknetConv2D from LeakyReLU to Mish. Since the Mish activation function is unbounded (i.e., positive values can reach any height) to avoid saturation because of capping, there is no problem of gradient disappearance during training. In addition, the smoothing nature of the Mish activation function gives it better performance in terms of solving and model generalization compared to other activation functions. The equation is as follows, with x being the input value.
Mish = x × tanh ( ln ( 1 + e x ) ) .
The SPP structure and PANet structure are used in the enhanced feature extraction network part, which can greatly increase the perceptual field and separate out the most significant contextual features. The prediction network YoloHead uses a more efficient feature layer to obtain predictions. The full network structure of YOLOv4 is shown in Figure 3.

2.2.2. GhostNet Structure

GhostNet [46] is a lightweight network model that employs Ghost modules instead of regular convolutions.The Ghost module is divided into three steps: convolution, Ghost generation, and feature map combination. The first step is to use the traditional convolution to obtain the feature maps. Then, the intrinsic feature map of each channel is processed by depth-separable convolution to generate the Ghost feature map. Finally, the intrinsic feature map obtained in the first step is combined with the Ghost feature map obtained in the second step, and the final result output is obtained. The Ghost module is shown in Figure 4.
Ghost bottleneck similar to the basic residual block in ResNet, it integrates multiple convolutional layers, and the Ghost bottleneck consists mainly of two stacks of the Ghost modules.The Ghost bottleneck is shown in Figure 5.
The Ghost bottleneck significantly reduces the computational volume and size of the model and achieves a good mix between processing, memory size, and accuracy. The GhostNet network structure is shown in Figure 6.

2.2.3. Improved Network

GhostNet was first used to find valid feature layers of the same height and width as YOLOv4’s CSPDarknet53, and then it was pruned according to these feature layers, as shown in Figure 7. The reduced GhostNet then replaces CSPDarknet53. To extract the enhanced features, the GhostNet feature layer is added to the enhanced feature extraction network.
Although GhostNet replaces the CSPDarknet backbone feature extraction network and reduces the number of parameters, most of the computation is performed on the enhanced feature extraction network PANet. Depthwise separable convolution is a combination of depthwise (DW) and pointwise (PW) components used to extract feature maps, with a lower number of parameters and lower computational cost than conventional convolution operations. Thus, we replaced traditional convolution with a combination of depth separation and traditional convolution.
Based on the specificity of the application and algorithm, we named the improved algorithm YOLOV4_RMLight. Figure 8 depicts the structure of the final improved network.

2.3. Dataset

This experiment used the dataset named “The DJI Robomaster Objects in Context (DJI ROCO)”, which is the official RoboMaster dataset released by DJI in 2019. We used 2065 tagged photos from the “Robomaster Central China Competition” folder as the base dataset; an example figure is shown in Figure 9. The dataset uses the Pascal VOC dataset format and consists of three parts, JPEGImages (which holds all the images from the training and testing), Annotations (which holds the XML file corresponding to each image after it has been tagged), and ImageSets (which has a main folder containing txt files with the names of the images, divided into training and testing). The objects to be recognized are divided into five categories in total, namely, car (robot), watcher (sentry robot), base (base), ignore (ignore), and armor (armor plate). The recognition effect is shown in Figure 10.

2.4. Data Augmentation

With the following image processing method, (1) Gaussian noise; (2) change brightness; (3) image translation; (4) image rotation; (5) image flipping; (6) image scaling; (7) cutout, random combinations process dataset pictures to form new images. Training with the enhanced dataset allows the network to increase its robustness. We augmented to triple the original dataset by this method. Experiments demonstrate that following data improvement, identification accuracy improves dramatically.

2.5. Experimental Design

2.5.1. Effectiveness Verification Experiments Design

To demonstrate the effectiveness of the improved method, we use the idea of the controlled variable method to design effectiveness experiments.
(1)
The YOLOv4, SSD, and YOLOv5x are trained separately using the dataset, and the evaluation results are obtained to prove the effectiveness of YOLOv4.
(2)
The original data were augmented using the data augmentation method in Section 2.4. The yolov4_RMLight network was used to train with the original data and the augmented data separately. The results before and after the augmented data demonstrated the effectiveness of the data augmentation method. We named the data-enhanced model yolov4_RMLight_ADD.
(3)
The yolov4_RMLight_ADD backbone feature extraction network remains unchanged, and its augmented feature extraction network is restored to the original YOLOv4 augmented feature extraction network. It is used to demonstrate the effectiveness of the improvements to the augmented feature extraction network.
(4)
The yolov4_RMLight_ADD enhanced feature extraction network is unchanged, and its backbone network is replaced with mobilenetv1, mobilenetv2, mobilenetv3, resnet [47], and vgg [48] to obtain the evaluation results and prove the effectiveness of using ghostnet.

2.5.2. Experimental Process Design

The device for training the algorithm model uses the CPU processor of Intel (R) Xeon (R) CPU E5-2678 v3 @ 2.50 GHz, equipped with GEFORCE RTX 3090 graphics card, 32 GB memory, and the deep learning framework of pytorch. Two devices were chosen as mobile robot controllers (as shown in Table 1) to test the performance of the algorithm on different devices. The object detection experiment of the wheeled mobile robot consists of two main parts: the training process and the testing process. The whole experimental process is shown in Figure 11.
Training process. Firstly, we downloaded the dataset from the official RoboMaster website. Secondly, the dataset was augmented by the data augmentation method in Section 2.4. Then, a cross-validation procedure was used to divide the dataset into a training set and a test set in a 9:1 ratio, and the training set was divided into a training set and a validation set in a 9:1 ratio. Finally, the training set was input into the network model for training with parameters set to batch size = 32 and epoch = 300. The trained loss function is shown in Figure 12.
Testing process. Firstly, the trained optimal model is deployed to the mobile robot device. Secondly, cameras are connected to mobile robotic devices. Then, the test set is input into a network model on a mobile robot device for testing. Finally, accuracy tests and performance tests are performed, and the experimental results are saved.

2.5.3. Evaluating Indicator

This research evaluates the algorithm’s quality in two ways: the accuracy of the identification and the performance of the algorithm.
The mAP (mean average precision) is commonly used to assess the algorithm’s accuracy. To begin, you must first determine the precision and recall, which are determined as follows:
precision = TP TP + FP ,
recall = TP TP + FN ,
In the formula, TP (true positive) means that it is divided into positive samples and correct. FP (false positive) means that it is positive but wrong (in fact, this sample is a negative sample). FN (false negative) means that it is divided into negative samples but wrong (in fact, this sample is this). AP is calculated for n samples of a certain classification; assuming that it has m positive cases, each positive case corresponds to a recall R value (1/m, 2/m, ..., 1). Maximum accuracy P was calculated for each recall, and then averaged for m p-values. The expression for calculating the AP is as follows:
AP = 1 m i m P i = 1 m × P 1 + 1 m × P 2 + + 1 m × P m .
While the AP is particular to a single class, a dataset often comprises a large number of classifications, and the mAP is calculated by averaging the AP of all classes in the datasets. The following is the formula for determining the mAP:
mAP = 1 C C j P j .
A boundary regression task is included in the object detection task, and its accuracy is commonly tested by IoU. The IoU is calculated using the following formula:
IoU = S 1 S 2 .
S1 is the area of overlap between the predicted boxes and actual boxes; S2 is the total area occupied by the predicted boxes and actual boxes. Thus, mAP (0.5) refers to the value of mAP at IoU = 0.5. The mAP (0.75) is the value of the mAP at IoU = 0.75.
The F1-score is a measure of a classification problem and is often used as the final metric for multi-classification problems; it is the summed average of precision and recall. F1-score can be calculated using the following formula:
F 1 = 2 × recall × precision recall + precision .
We set the performance metric to the time it takes to collect the mAP values in order to better discover the algorithm’s time performance because the time it takes to calculate the mAP comprises both the time it takes to infer the image and the time it takes to calculate the aforementioned series of numbers. It can correctly display the algorithm’s performance on a variety of devices. We also use model size, number of parameters, and flops as a measure of the performance of the algorithm on hardware.

3. Results

3.1. Effectiveness Experiment Results

In order to further analyze the effectiveness of the application methods proposed in this paper, we used the experimental setup in Section 2.5.1 and measured these methods through evaluation metrics.
The experimental results of the effectiveness of YOLOv4 are shown in Table 2. Compared to mainstream object detection networks such as SSD and YOLOv5x, YOLOv4 performs very well on this dataset. The mAP (0.5) values for YOLOv4 are 31.27% and 20.11% higher than SSD and YOLOv5x, respectively. Moreover, the flops value of YOLOv4 is lower than both of them, indicating that the model is relatively less computationally intensive.
Data augmentation experiment results are shown in Table 3. After data-enhancement experiments using yolov4_RMLight, it can be seen that the values of mAP (0.5) and mAP (0.75) have increased by 5.49% and 6.11%, respectively, after data enhancement. The effectiveness of the data-enhancement method is proved.
The experimental results to improve the effectiveness of the enhanced feature extraction network are shown in Table 4. It is evident from the results that the improvements in the enhanced feature extraction network substantially reduce the number and size of parameters in the model. This makes the improved network much lighter.
GhostNet was used as the backbone feature extraction network in yolov4_RMLight_ADD. To verify the effectiveness of GhostNet, the backbone feature network was replaced using an existing mainstream classification network. The experimental results are shown in Table 5. The experimental results clearly show that GhostNet has a clear advantage as a backbone feature extraction network in terms of both accuracy and performance. The non-lightweight Resnet50 and VGG16 far exceeded the lightweight networks in terms of computational effort and model size, which made deployment on mobile robotic devices more difficult, and several popular lightweight networks do not perform as well as GhostNet in terms of accuracy. GhostNet was therefore chosen as the backbone extraction network.

3.2. Contrast Experimental Results

In order to better analyze our proposed improved algorithm, we conducted comparative experiments using some lightweight object detection algorithms. The quantitative analysis was also carried out in terms of both the accuracy of recognition and the performance of the algorithms. The results of the comparison experiments are shown in Table 6 and Table 7.
Based on the results of the contrast experiments, the following conclusions can be reached.
(1)
In Table 6, the proposed model’s precision, recall, mAP (0.5), mAP (0.75) values, and F1-scores were 88.89%, 87.12%, 50.91%, and 88.00%, respectively.
(2)
From Table 6, we can see that our proposed yolov4_RMLight_ADD model has high precision and mAP (0.5) values, with precision values higher than the lightweight network YOLOv4_tiny, Eifficientdet-d0 [49], YOLOv5n, and YOLOv5s by 30.97%, 10.06%, 13.17%, and 13.01%. The mAP (0.5) values are 29.34%, 28.99%, 20.36%, and 18.64% higher than those of the lightweight networks YOLOv4_tiny, Eifficientdet-d0, YOLOv5n, and YOLOv5s, respectively, and the precision value is 0.26% higher than the large network YOLOv4. The mAP (0.5) value is only 2.23% lower than YOLOv4. This shows that our proposed yolov4_RMLight_ADD model has excellent object detection accuracy.
(3)
From Table 6, we can see that our proposed yolov4_RMLight_ADD model has higher recall and F1 values, with recall values 30.84%, 33.85%, and 23.75% higher than the lightweight networks YOLOv4_tiny, Eifficientdet-d0, YOLOv5n, and YOLOv5s (21.65%). The F1 values are 31%, 31.8%, 20.8%, and 19% higher than those of the lightweight networks YOLOv4_tiny, Eifficientdet-d0, YOLOv5n, and YOLOv5s, respectively, and recall and F1 values are only 2.36% and 1.4% lower than those of YOLOv4. This shows that our proposed yolov4_RMLight_ADD model has effect on object recognition.
(4)
As can be seen from Table 7, our proposed model size of yolov4_RMLight_ADD is 42.5 MB, the number of parameters is 10.8 million, and the flops is 3.53 GB. The model size is 201.5 MB less than YOLOv4, the number of parameters is 83.21% less, and the flops are 88.30% less. Although the model size is 20.25 MB, 27.5 MB, 35.55 MB, and 15.4 MB more than the lightweight networks YOLOv4_tiny, Eifficientdet-d0, YOLOv5n, and YOLOv5s, respectively, the flops are similar to the lightweight networks, which means that the computational consumption on the hardware is is about the same.
(5)
Overall, the yolov4_RMLight_ADD model proposed in this paper has the highest precision, recall, mAP value, and F1-score among the lightweight networks; moreover, although the model size and number of parameters are more than the lightweight networks, the flops are similar. Considering the computational performance of the mobile robot device and the excellent performance of the yolov4_RMLight_ADD model in terms of recognition accuracy far above other lightweight networks, the yolov4_RMLight_ADD model is more suitable for deployment on mobile robots.
In addition to detection accuracy, the time performance of the object detection algorithm on mobile robotic devices is also a key metric. We obtained the time to calculate the mAP value on different devices by deploying the object detection method on different devices, as shown in Figure 13. We also used the camera on the robot to test the real-time detection capability of the object detection algorithm, and Figure 14 shows the results of the FPS tests on different devices.
Based on the experimental results of the time performance, the following conclusions can be drawn.
(1)
As can be seen from Figure 13, the three models consume similar times on the PC. On the Manifold 2-G device, our proposed model consumes 100.12 s, which is 53.09% lower than YOLOv4 and only 24.24 s higher than the lightweight network YOLOv4_tiny. On the Dell OptiPlex 3060 device, our proposed model consumes 534.55 s, which is 88.62% lower compared to YOLOv4 and only 434.42 s higher than the lightweight network YOLOv4. 88.62% and only 434.42 s higher than the lightweight network YOLOv4_tiny. It can be seen that on high-performance devices, the time spent by the different models is similar. On resource-limited devices, the fewer computational resources the device has, the greater the gap in computation time between different levels of algorithms. It is clear that our proposed model drastically reduces the computation time relative to the large network YOLOv4. Although not as small as that of the lightweight network YOLOv4_tiny, they are both of the same order of magnitude. Combined with the accuracy, our proposed algorithm is more advantageous.
(2)
As can be seen from Figure 14, our proposed model can achieve close to 30 FPS for real-time detection on pc devices, while YOLOv4 can only maintain around 20 FPS, and the lightweight network YOLOv4_tiny can achieve around 45 FPS. On the Manifold 2-G device, our proposed model achieves close to 10 FPS, while YOLOv4 only stays at around 2 FPS and the lightweight network YOLOv4_tiny achieves around 25 FPS. On the Dell OptiPlex 3060 device, our proposed model stays at around 2 FPS, while YOLOv4 is essentially 0 FPS, and lightweight network YOLOv4_tiny only reaches around 5 FPS. It is clear that the large model YOLOv4 is largely unavailable on resource-constrained devices, while our proposed model is not as good as the lightweight network, but is still generally available. Combined with the accuracy, our proposed model is very advantageous.

4. Conclusions

This paper presents an object detection algorithm based on a combination of the improved YOLOv4 and the improved GhostNet. The algorithm maximizes the detection accuracy while maintaining high accuracy. The improved principle of the algorithm is to replace the backbone feature extraction network of YOLOv4 with a portion of GhostNet, and then replace the normal convolution in the enhanced feature extraction network with a depth-separated convolution, and we also devise data-enhancement methods to extend the dataset. We designed effectiveness experiments using the idea of the controlled variable method, and the experimental results showed that each of our proposed improvements was effective.
We also compared the algorithm with other existing lightweight object detection networks, and our accuracy, recall, mAP (0.5), and F1 values were the highest compared to other lightweight networks, although the model size was slightly higher than theirs, but the flops were similar. It is also in the same order of magnitude as the lightweight network in terms of real-time performance tests. Combined with a far higher accuracy than other lightweight networks, our proposed yolov4_RMLight_ADD is a better choice.
However, the proposed algorithm has some limitations. Its performance in real-time detection tasks needs to be improved and it needs to be studied in conjunction with some robot control tasks, and future work will focus on combining these two points.

Author Contributions

Conceptualization, Y.H. and J.G.; methodology, Y.H., Z.C. and J.G.; software, Y.H. and G.L.; validation, Y.H., G.L. and Z.C.; formal analysis, Y.H.; investigation, J.G. and G.L.; resources, Y.H.; data curation, G.L. and J.G.; writing—original draft preparation, Y.H.; writing—review and editing, G.L., J.G. and Z.C.; visualization, Y.H.; supervision, G.L.; project administration, J.G.; funding acquisition, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partially funded by the key research projects of the Science and Technology Department of Jilin Province 20210201113GX and 20200401127GX; partially funded by key projects of Education Department of Jilin Province JKH20210754KJ.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
  2. Khan, S.; Rahmani, H.; Shah, S.A.A.; Bennamoun, M. A guide to convolutional neural networks for computer vision. Synth. Lect. Comput. Vis. 2018, 8, 1–207. [Google Scholar] [CrossRef]
  3. Nayak, R.; Manohar, N. Computer-Vision based Face Mask Detection using CNN. In Proceedings of the 2021 6th International Conference on Communication and Electronics Systems (ICCES), Virtual, 14–23 June 2021; pp. 1780–1786. [Google Scholar]
  4. Dorrer, M.; Tolmacheva, A. Comparison of the YOLOv3 and Mask R-CNN architectures’ efficiency in the smart refrigerator’s computer vision. J. Phys. Conf. Ser. 2020, 1679, 42022. [Google Scholar] [CrossRef]
  5. Dimitri, G.M.; Spasov, S.; Duggento, A.; Passamonti, L.; Toschi, N. Unsupervised stratification in neuroimaging through deep latent embeddings. In Proceedings of the 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Montreal, QC, Canada, 20–24 July 2020; pp. 1568–1571. [Google Scholar]
  6. Havaei, M.; Davy, A.; Warde-Farley, D.; Biard, A.; Courville, A.; Bengio, Y.; Pal, C.; Jodoin, P.M.; Larochelle, H. Brain tumor segmentation with deep neural networks. Med. Image Anal. 2017, 35, 18–31. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented r-cnn for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
  8. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2014; pp. 580–587. [Google Scholar]
  9. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  11. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
  12. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
  13. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  14. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer VSision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
  15. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  16. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
  17. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
  18. Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
  19. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  20. Adarsh, P.; Rathi, P.; Kumar, M. YOLO v3-Tiny: Object Detection and Recognition using one stage improved model. In Proceedings of the 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), New York, NY, USA, 4–6 January 2020; pp. 687–694. [Google Scholar]
  21. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]
  22. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  23. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  24. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  25. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
  26. Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
  27. Xiong, Y.; Liu, H.; Gupta, S.; Akin, B.; Bender, G.; Wang, Y.; Kindermans, P.J.; Tan, M.; Singh, V.; Chen, B. Mobiledets: Searching for object detection architectures for mobile accelerators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3825–3834. [Google Scholar]
  28. Huang, G.; Liu, S.; Van der Maaten, L.; Weinberger, K.Q. Condensenet: An efficient densenet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2752–2761. [Google Scholar]
  29. Wong, A.; Famuori, M.; Shafiee, M.J.; Li, F.; Chwyl, B.; Chung, J. Yolo nano: A highly compact you only look once convolutional neural network for object detection. In Proceedings of the 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), Vancouver, BC, Cadana, 13 December 2019; pp. 22–25. [Google Scholar]
  30. Qin, Z.; Li, Z.; Zhang, Z.; Bao, Y.; Yu, G.; Peng, Y.; Sun, J. ThunderNet: Towards real-time generic object detection on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 6718–6727. [Google Scholar]
  31. Yang, H.; Chen, L.; Ma, Z.; Chen, M.; Zhong, Y.; Deng, F.; Li, M. Computer vision-based high-quality tea automatic plucking robot using Delta parallel manipulator. Comput. Electron. Agric. 2021, 181, 105946. [Google Scholar] [CrossRef]
  32. Yang, H.; Chen, L.; Chen, M.; Ma, Z.; Deng, F.; Li, M.; Li, X. Tender tea shoots recognition and positioning for picking robot using improved YOLO-V3 model. IEEE Access 2019, 7, 180998–181011. [Google Scholar] [CrossRef]
  33. Kuznetsova, A.; Maleva, T.; Soloviev, V. Using YOLOv3 algorithm with pre-and post-processing for apple detection in fruit-harvesting robot. Agronomy 2020, 10, 1016. [Google Scholar] [CrossRef]
  34. Hu, X.; Liu, Y.; Zhao, Z.; Liu, J.; Yang, X.; Sun, C.; Chen, S.; Li, B.; Zhou, C. Real-time detection of uneaten feed pellets in underwater images for aquaculture using an improved YOLO-V4 network. Comput. Electron. Agric. 2021, 185, 106135. [Google Scholar] [CrossRef]
  35. Gai, R.; Chen, N.; Yuan, H. A detection algorithm for cherry fruits based on the improved YOLO-v4 model. Neural Comput. Appl. 2021, 1–12. [Google Scholar] [CrossRef]
  36. Wu, D.; Lv, S.; Jiang, M.; Song, H. Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput. Electron. Agric. 2020, 178, 105742. [Google Scholar] [CrossRef]
  37. Zhao, K.; Wang, Y.; Zuo, Y.; Zhang, C. Palletizing Robot Positioning Bolt Detection Based on Improved YOLO-V3. J. Intell. Robot. Syst. 2022, 104, 1–12. [Google Scholar] [CrossRef]
  38. Li, S.; Zhan, J.; Lian, H.; Huang, M.; Gao, X.; Lu, Z.; Xu, W.; Xu, G. Indoor vision navigation and target tracking system for aerial robot. In Proceedings of the 2020 2nd International Conference on Artificial Intelligence and Advanced Manufacture (AIAM), Manchester, UK, 15–17 October 2020; pp. 57–62. [Google Scholar]
  39. Xiang, H.; Cheng, L.; Wu, H.; Chen, Y.; Gao, Y. Mobile Robot Automatic Aiming Method Based on Binocular Vision. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 4150–4156. [Google Scholar]
  40. Tang, X.; Leng, C.; Guan, Y.; Hao, L.; Wu, S. Development of tracking and control system based on computer vision for roboMaster competition robot. In Proceedings of the 2020 5th International Conference on Advanced Robotics and Mechatronics (ICARM), Shenzhen, China, 18–21 December 2020; pp. 442–447. [Google Scholar]
  41. Li, Y.; Yan, J.; Hu, B. Mask detection based on efficient-YOLO. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 4056–4061. [Google Scholar]
  42. Sahib, F.A.A.; Taher, H.; Ghani, R.F. Detection of the autonomous car robot using Yolo. J. Phys. Conf. Ser. 2021, 1879, 32129. [Google Scholar] [CrossRef]
  43. Cao, Z.; Liao, T.; Song, W.; Chen, Z.; Li, C. Detecting the shuttlecock for a badminton robot: A YOLO based approach. Expert Syst. Appl. 2021, 164, 113833. [Google Scholar] [CrossRef]
  44. Gu, S.; Chen, X.; Zeng, W.; Wang, X. A deep learning tennis ball collection robot and the implementation on nvidia jetson tx1 board. In Proceedings of the 2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Auckland, New Zealand, 9–12 July 2018; pp. 170–175. [Google Scholar]
  45. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  46. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
  47. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  48. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  49. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Figure 1. Robot platform. The characters and numbers in the diagram represent equipment modules, the details of which are shown below: 1. Referee system speed measurement module SM01. 2. Launch mechanism module. 3. 2-axis gimbal module. 4. ammunition supply module. 5. Referee system light bar module LI01. 6. Power module. 7. Referee System Master Control Module MC02. 8. Referee System Armour Module AM02. 9. TB47S Battery. A. Camera. B. Li-DAR. C. Referee System Positioning Module UW01.
Figure 1. Robot platform. The characters and numbers in the diagram represent equipment modules, the details of which are shown below: 1. Referee system speed measurement module SM01. 2. Launch mechanism module. 3. 2-axis gimbal module. 4. ammunition supply module. 5. Referee system light bar module LI01. 6. Power module. 7. Referee System Master Control Module MC02. 8. Referee System Armour Module AM02. 9. TB47S Battery. A. Camera. B. Li-DAR. C. Referee System Positioning Module UW01.
Applsci 12 04769 g001
Figure 2. CSPDarknet53 structure. The figure on the left shows the structure of Darknet53, while the figure on the right shows the CSPDarknet53 structure with improvements made to the Darknet53 structure.
Figure 2. CSPDarknet53 structure. The figure on the left shows the structure of Darknet53, while the figure on the right shows the CSPDarknet53 structure with improvements made to the Darknet53 structure.
Applsci 12 04769 g002
Figure 3. YOLOv4 network structure. CSPDarknet53 is the backbone feature extraction network for YOLOv4. SPP and PANet are enhanced feature extraction networks for YOLOv4. Yolo Head is the detection head of the network. The left side of the figure shows the detailed structure of the backbone feature extraction network.
Figure 3. YOLOv4 network structure. CSPDarknet53 is the backbone feature extraction network for YOLOv4. SPP and PANet are enhanced feature extraction networks for YOLOv4. Yolo Head is the detection head of the network. The left side of the figure shows the detailed structure of the backbone feature extraction network.
Applsci 12 04769 g003
Figure 4. Ghost module structure. The letter “a” in the figure represents the depth-separable convolution operation.
Figure 4. Ghost module structure. The letter “a” in the figure represents the depth-separable convolution operation.
Applsci 12 04769 g004
Figure 5. Ghost bottleneck. (Left): Ghost bottleneck with stride = 1; (right): Ghost bottleneck with stride = 2.
Figure 5. Ghost bottleneck. (Left): Ghost bottleneck with stride = 1; (right): Ghost bottleneck with stride = 2.
Applsci 12 04769 g005
Figure 6. GhostNet network structure. G-bneck represents the Ghost bottleneck.
Figure 6. GhostNet network structure. G-bneck represents the Ghost bottleneck.
Applsci 12 04769 g006
Figure 7. Improved GhostNet network structure. The improved GhostNet trims the four G-bneck modules, the AvgPool layer, and the FC layer from the original GhostNet.
Figure 7. Improved GhostNet network structure. The improved GhostNet trims the four G-bneck modules, the AvgPool layer, and the FC layer from the original GhostNet.
Applsci 12 04769 g007
Figure 8. Improved YOLOv4 network structure. The detailed parameters of the improved convolution module are as follows: conv_three_1: conv2d(160,512,1), conv_dw(512,1024), conv2d(1024,512,1); conv_three_2: conv2d(2048,512,1), conv_dw(512,1024), conv2d(1024,512,1); conv_five_1:conv2d (256,128,1), conv_dw(128,256), conv2d(256,128,1), conv_dw(128,256), conv2d(256,128,1); conv_five_2: conv2d(512,256,1), conv_dw(256,512), conv2d(512,256,1), conv_dw(256,512), conv2d(512,256,1); conv_five_3:conv2d(1024,512,1), conv_dw(512,1024), conv2d(1024,512,1), conv_dw(512,1024), conv2d (1024,512,1); conv_five_4:conv2d(512,256,1), conv_dw(256,512), conv2d(512,256,1), conv_dw(256,512), conv2d(512,256,1). The conv2d represents the ordinary convolution and the conv_dw represents the depth-separable convolution.
Figure 8. Improved YOLOv4 network structure. The detailed parameters of the improved convolution module are as follows: conv_three_1: conv2d(160,512,1), conv_dw(512,1024), conv2d(1024,512,1); conv_three_2: conv2d(2048,512,1), conv_dw(512,1024), conv2d(1024,512,1); conv_five_1:conv2d (256,128,1), conv_dw(128,256), conv2d(256,128,1), conv_dw(128,256), conv2d(256,128,1); conv_five_2: conv2d(512,256,1), conv_dw(256,512), conv2d(512,256,1), conv_dw(256,512), conv2d(512,256,1); conv_five_3:conv2d(1024,512,1), conv_dw(512,1024), conv2d(1024,512,1), conv_dw(512,1024), conv2d (1024,512,1); conv_five_4:conv2d(512,256,1), conv_dw(256,512), conv2d(512,256,1), conv_dw(256,512), conv2d(512,256,1). The conv2d represents the ordinary convolution and the conv_dw represents the depth-separable convolution.
Applsci 12 04769 g008
Figure 9. Example figure of the dataset. There are four objects to be identified in the figure. They are car, base, and watcher, and armor is among the above objects.
Figure 9. Example figure of the dataset. There are four objects to be identified in the figure. They are car, base, and watcher, and armor is among the above objects.
Applsci 12 04769 g009
Figure 10. Identification renderings. This figure shows the recognition effect of Figure 9, with rectangular boxes and text indicating the objects to be recognized.
Figure 10. Identification renderings. This figure shows the recognition effect of Figure 9, with rectangular boxes and text indicating the objects to be recognized.
Applsci 12 04769 g010
Figure 11. Experimental flowchart. The experimental process is divided into a training process and a testing process. The training process mainly includes the manipulation of the dataset, the tuning of parameters, and the training of the model network. The testing process mainly consists of configuring the equipment and testing.
Figure 11. Experimental flowchart. The experimental process is divided into a training process and a testing process. The training process mainly includes the manipulation of the dataset, the tuning of parameters, and the training of the model network. The testing process mainly consists of configuring the equipment and testing.
Applsci 12 04769 g011
Figure 12. The convergence curve of the loss function for different models. (a) Represents the loss function plot for the training of our proposed network model. (b) Represents the loss function plot for the training of the YOLOv4 network model. (c) Represents the loss function plot for the training of the YOLOv4_tiny network model. (d) Represents the loss function plot for the training of the yolov4_mobilenetv1 network model. The yolov4_mobilenetv1 means replacing YOLOv4’s backbone feature extraction network with MobileNetv1. (e) Represents the loss function plot for the training of the yolov4_mobilenetv2 network model. The yolov4_mobilenetv2 means replacing YOLOv4’s backbone feature extraction network with MobileNetv2. (f) Represents the loss function plot for the training of the yolov4_mobilenetv3 network model. The yolov4_mobilenetv3 means replacing YOLOv4’s backbone feature extraction network with MobileNetv3.
Figure 12. The convergence curve of the loss function for different models. (a) Represents the loss function plot for the training of our proposed network model. (b) Represents the loss function plot for the training of the YOLOv4 network model. (c) Represents the loss function plot for the training of the YOLOv4_tiny network model. (d) Represents the loss function plot for the training of the yolov4_mobilenetv1 network model. The yolov4_mobilenetv1 means replacing YOLOv4’s backbone feature extraction network with MobileNetv1. (e) Represents the loss function plot for the training of the yolov4_mobilenetv2 network model. The yolov4_mobilenetv2 means replacing YOLOv4’s backbone feature extraction network with MobileNetv2. (f) Represents the loss function plot for the training of the yolov4_mobilenetv3 network model. The yolov4_mobilenetv3 means replacing YOLOv4’s backbone feature extraction network with MobileNetv3.
Applsci 12 04769 g012
Figure 13. Time performance comparison. The vertical axis indicates the time taken to calculate the mAP value in seconds. The horizontal axis indicates the different devices.
Figure 13. Time performance comparison. The vertical axis indicates the time taken to calculate the mAP value in seconds. The horizontal axis indicates the different devices.
Applsci 12 04769 g013
Figure 14. FPS test results. The vertical axis indicates the FPS value of the model for real-time detection. The horizontal axis indicates the different devices.
Figure 14. FPS test results. The vertical axis indicates the FPS value of the model for real-time detection. The horizontal axis indicates the different devices.
Applsci 12 04769 g014
Table 1. Configuration parameters for the robot controller. (The robot controller is the CPU of the robot and provides various interfaces to connect different external devices and handle the associated computing operations).
Table 1. Configuration parameters for the robot controller. (The robot controller is the CPU of the robot and provides various interfaces to connect different external devices and handle the associated computing operations).
NameProcessorSystemMemoryGFLOPS
Manifold 2-GNVIDIA Jetson TX2Ubuntu 18.048 GB 128 bit1260
Dell OptiPlex 3060Intel(R) Pentium(R) Gold G5400T CPU@ 3.10 GHzWindows 1016 GB 64 bit99.2
Table 2. YOLOv4 validity experiment results.
Table 2. YOLOv4 validity experiment results.
AlgorithmsmAP (0.5)Model Size/MBFlops
YOLOv488.63%24430.17 GB
SSD57.36%92.631.35 GB
YOLOv5x68.52%333109 GB
Table 3. Data augmentation experiment results.
Table 3. Data augmentation experiment results.
AlgorithmsmAP ( 0.5)mAP (0.75)
yolov4_RMLight_ADD86.84%50.91%
yolov4_RMLight81.35%44.80%
Table 4. The experimental results of improving the effectiveness of the enhanced feature extraction network.
Table 4. The experimental results of improving the effectiveness of the enhanced feature extraction network.
AlgorithmsmAP (0.5)Model Size/MBParameter Quantity/Million
yolov4_RMLight_ADD86.84%42.5010.80
yolov4_RMLight_ADD_070.31%14538.38
Yolov4_RMLight_ADD_0 denotes the network model after reduction to the enhanced feature extraction network of YOLOv4.
Table 5. The experimental results for improving the effectiveness of the backbone feature extraction network.
Table 5. The experimental results for improving the effectiveness of the backbone feature extraction network.
AlgorithmsmAP (0.5)Model Size/MBFlops
yolov4_RMLight_ADD86.84%42.503.53 GB
yolov4_mobilenetv179.46%51.105.27 GB
yolov4_mobilenetv279.68%46.604.08 GB
yolov4_mobilenetv379.68%46.603.80 GB
yolov4_resnet5065.36%12733.68 GB
yolov4_vgg1666.93%9023.94 GB
Yolov4_mobilenetv1 indicates that the backbone feature extraction network is replaced with mobilenetv1. Yolov4_mobilenetv2 indicates that the backbone feature extraction network is replaced with mobilenetv2. Yolov4_mobilenetv3 indicates that the backbone feature extraction network is replaced with mobilenetv3. Yolov4_resnet50 indicates that the backbone feature extraction network is replaced with Resnet50. Yolov4_vgg16 denotes replacing the backbone feature extraction network with VGG16.
Table 6. Accuracy results for different detection models.
Table 6. Accuracy results for different detection models.
AlgorithmsPrecisionRecallmAP (0.5)mAP (0.75)F1
yolov4_RMLight_ADD88.89%87.12%86.84%50.91%88.00%
YOLOv488.63%89.48%89.07%49.20%89.40%
YOLOv4_tiny57.92%56.28%57.50%44.92%57.00%
Eifficientdet-d078.83%53.27%57.85%51.99%56.20%
YOLOv5n75.72%63.37%66.48%52.07%67.20%
YOLOv5s75.88%65.47%68.20%52.35%69.00%
Table 7. Performance results for different detection models.
Table 7. Performance results for different detection models.
AlgorithmsModel Size/MBParameter Quantity/MillionFlops
yolov4_RMLight_ADD42.5010.803.53 G
YOLOv424464.3630.17 G
YOLOv4_tiny22.256.063.47 G
Eifficientdet-d0153.872.55 G
YOLOv5n6.951.882.33 G
YOLOv5s27.17.288.53 G
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Hu, Y.; Liu, G.; Chen, Z.; Guo, J. Object Detection Algorithm for Wheeled Mobile Robot Based on an Improved YOLOv4. Appl. Sci. 2022, 12, 4769. https://doi.org/10.3390/app12094769

AMA Style

Hu Y, Liu G, Chen Z, Guo J. Object Detection Algorithm for Wheeled Mobile Robot Based on an Improved YOLOv4. Applied Sciences. 2022; 12(9):4769. https://doi.org/10.3390/app12094769

Chicago/Turabian Style

Hu, Yanxin, Gang Liu, Zhiyu Chen, and Jianwei Guo. 2022. "Object Detection Algorithm for Wheeled Mobile Robot Based on an Improved YOLOv4" Applied Sciences 12, no. 9: 4769. https://doi.org/10.3390/app12094769

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop