Next Article in Journal
Assessing Climate Change Effects on Winter Wheat Production in the 3H Plain: Insights from Bias-Corrected CMIP6 Projections
Previous Article in Journal
Research on the Identification Method of Maize Seed Origin Using NIR Spectroscopy and GAF-VGGNet
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lightweight Small-Tailed Han Sheep Facial Recognition Based on Improved SSD Algorithm

Department of Mechanical and Electrical Engineering, Inner Mongolia Agricultural University, Hohhot 010010, China
*
Author to whom correspondence should be addressed.
Agriculture 2024, 14(3), 468; https://doi.org/10.3390/agriculture14030468
Submission received: 16 January 2024 / Revised: 8 March 2024 / Accepted: 11 March 2024 / Published: 13 March 2024
(This article belongs to the Section Digital Agriculture)

Abstract

:

Simple Summary

To achieve automated farming management, including the recording, tracking, and statistics of sheep, we harness deep learning technology for sheep face recognition research, and the further development of lightweight sheep face recognition models. Deep learning steers the neural network approach in a lightweight direction, simultaneously improving the accuracy of the model. The resulting models are smaller, faster, and capable of completing image recognition tasks more quickly, even with hardware resource constraints. Deep learning directs the neural net road in a lightweight direction, while improving the accuracy of the model, whereby the process is smaller, quicker, and can complete image recognition work faster despite hardware resource constraints. In this study, we improve the FPS of recognition to better adapt to video streaming recognition techniques.

Abstract

We propose a lightweight detection algorithm based on the Single Shot MultiBox Detector (SSD) algorithm in order to facilitate sheep management and to realize sheep facial identification, and we take the self-constructed dataset as the research object. First, the SSD replaces the VGG16 backbone network with MobileNetv3, a lightweight neural network, to create a hybrid model that is much smaller. Second, the ECA attention mechanism is incorporated into the backend of the 72 × 160 bottleneck layer. Finally, the SmoothL1 loss function is substituted with the BalancedL1 loss function. The optimized model’s size decreases significantly from the original SSD’s 132 MB to just 22.4 MB. It achieves a mean average precision of 83.47% and maintains an average frame rate of 68.53 frames per second. Compared to the basic SSD model, the mean average precision has increased by 3.25 percentage points, the model size has decreased by 109.6 MB, and the detection speed has improved by 9.55 frames per second. In comparative experiments using the same dataset with different object detection models, the proposed model outperforms the SSD, Faster R-CNN, Retinanet, and CenterNet in terms of mean average precision, with improvements of 3.25 percentage points, 4.71 percentage points, 2.38 percentage points, and 8.13 percentage points, respectively. The detection speed has shown significant improvements, increasing by 9.55, 58.55, 53.1, and 12.37 frames per second, respectively. The improved model presented in this paper significantly reduces the model’s size and computational requirements while maintaining an excellent performance. This provides a valuable reference for the digitalization of animal husbandry and livestock farming.

1. Introduction

Extensive grasslands provide favorable conditions for the development of cattle and sheep livestock, with grazing being the primary method of animal husbandry [1,2,3]. In recent years, due to the pursuit of greater economic benefits, pastoralists have been blindly expanding the quantity of livestock, resulting in a gradual increase in the scale of livestock farming [4,5,6]. This expansion has hindered effective livestock management. The current pressing issue is to achieve automation, intelligence, informatization, and lightweight management in animal husbandry [7]. Therefore, the automatic recognition of individual animal identities has become indispensable [8,9,10].
Currently, traditional livestock individual identification methods include manual observation and invasive device technology. The manual observation method depends heavily on the observer’s experience and memory, leading to subjectivity, reduced accuracy, and requiring substantial labor effort. An invasive device primarily employs methods like ear tagging and radio frequency identification (RFID) tags [11]. While these methods are widely used for the individual identification of livestock and have low cost, they do come with drawbacks, such as compromising the integrity of sheep and being susceptible to tampering or replacement. Conventional approaches depend on equipment rather than the animals themselves, reducing the reliability of individual animal identification.
As deep learning technology advances, model structures have become increasingly complex, and the demand for training data and computing power has grown significantly. This has prompted the trend toward making smaller, lightweight neural networks. These lightweight models aim to maintain accuracy with smaller size and faster speed. Currently, a range of convolutional neural networks are employed in facial recognition applications. However, research focused on lightweight models started relatively late, and there is limited relevant research in this area [12,13,14,15].
In 2022, Nan et al. [16] introduced a compact MobileNet model. First, an attention module was integrated into the MobileNetv1 model to improve the extraction of facial expression-related local features. They also added dropout to prevent overfitting. Subsequently, the model parameters were fine-tuned using center loss and soft-mimum loss techniques to minimize intra-class distance and maximize inter-class distance. As a result, the recognition accuracy reached 84.49%. In 2023, Raden Bimo Rizki Prayogo et al. [17] aimed to achieve accurate performance with limited computational resources. They validated the MobileFaceNet and SeesawFaceNet face recognition networks and utilized transfer learning to enhance network learning capabilities. The experimental results demonstrated that the MobileFaceNet model outperformed SeesawFaceNet. This resulted in an accuracy of 85% and an average processing speed of 44 milliseconds. In 2022, Ma et al. [18] presented a non-invasive method for studying individual pig recognition. To optimize the YOLOv4 model, they initially substituted MobileNetv3 as the backbone network. Then, they incorporated depthwise separable convolutions into the SPP and PANet feature extraction networks to reduce network parameters. Additionally, the CBAM (Channel Attention and Spatial Attention) mechanism was integrated into PANet to balance network accuracy and model weight. As a result, the experimental outcomes demonstrated an impressive accuracy of 98.15%. In 2022, Li et al. [19] developed the MobileViTFace sheep face recognition algorithm by combining MobileNetv2 with Vision Transformer. MobileViTFace utilized the Transformer architecture to improve the model’s capability in extracting detailed features and reducing background interference, resulting in a more effective differentiation of various sheep faces. The recognition accuracy reached 96.94%, representing a significant improvement of 9.79% compared to MobileNetv2. In 2023, Zhang et al. [20] substituted the feature extraction components in the backbone and neck of YOLOv5s with ShuffleNetv2 and Ghost modules. Additionally, they introduced a Channel Attention (CA) module in the backbone. The experimental results showed that this enhanced algorithm achieved an mAP (mean average precision) of 97.8%. In 2023, Guo et al. [21] aimed to enhance the robustness of their model. They transferred the parameters learned from the complex YOLOv5x to the lightweight YOLOv5s. This transfer was performed to enhance the feature extraction capabilities of PANet and improve recognition accuracy. The experimental results revealed an mAP (mean average precision) of 94.67%, which was 4.83% higher than that of YOLOv5s, and it also reduced the time required by 74.61 milliseconds compared to YOLOv5x. In summary, thanks to the collective efforts of researchers both domestically and internationally, the usage of deep learning technology in facial recognition is currently thriving. For livestock, facial recognition for pigs and cows has become relatively mature. However, there is less application of facial recognition for sheep, especially using lightweight neural networks [22,23,24]. Additionally, there is a lack of large publicly available sheep facial recognition datasets. In order to address the resource constraints in farms, which typically have limited computational resources, such as low-power devices or limited memory capacity, the model size is being reduced. This reduction in size makes it easier to deploy the model on these resources without sacrificing performance. Another benefit of using smaller models is improvement in speed and efficiency. Smaller models require less computational power, leading to faster inference. This is particularly advantageous for time-sensitive agricultural applications like real-time monitoring or decision-making. Additionally, smaller model sizes facilitate data collection and management in challenging agricultural environments with limited connectivity. By reducing the amount of data needed for model updates or predictions, smaller models become easier to manage and transmit. Cost considerations also come into play. Smaller models may require less storage space and computational resources, which can be a significant cost-saving. In this study, a recognition model for sheep faces was used as an example. A lightweight neural network model was built using a dataset of 5371 clear facial photos from 114 sheep taken from different angles. The SSD algorithm formed the basis for sheep facial detection and subsequent lightweight algorithm enhancements, which were then verified.

2. Materials and Methods

2.1. Sheep Face Data Set Acquisition

Mongolian sheep, the most numerous and widely distributed sheep breed in China, include the subspecies Small-Tailed Han sheep. Small-Tailed Han sheep generally have a white or gray body with a white face. Some may exhibit black patches around their eyes, while a few ewes might have black and brown spots on their faces. Male Small-Tailed Han sheep have spiral-shaped horns, while female sheep have small horns (some ewes may be hornless). The experimental dataset was gathered from farms owned by local farmers in Hake Town, Hulunbuir Grassland, Hailar District, Hulunbuir City, Inner Mongolia, China. The coordinates of this location are around 49°13′ North latitude and 120°04′ East longitude. The region is characterized by a typical continental climate with an average elevation of 620 m. From March 2022 to August 2022, 114 Small-Tailed Han sheep were photographed at a livestock farm. The sheep had an average age of 2 years and an average weight of 45 kg. The group consisted of 102 ewes and 12 rams. Camera used Canon 700d (Canon factory in Tokyo, Japan). Frame rate is 29.97 FPS. Each sheep was recorded for approximately 2–3 min. The recorded videos were sorted by sequence number, ranging from 1 to 114, representing individual identity information, in order to facilitate the subsequent dataset creation. To construct the dataset, images were extracted from the recorded videos using a Python program. Images that were blurry, heavily occluded, repeated, or underexposed were removed. This process resulted in 114 classes, containing a total of 5371 facial images of Small-Tailed Han sheep captured from various angles. Two methods were employed for data augmentation to enhance the robustness of the detection model: (1) Various levels of brightness adjustment were implemented on certain images, including a 1.5× increase or a 0.6× decrease in brightness—this helped mitigate the impact of lighting condition variations on the object detection model; (2) The images’ contrast was adjusted by a factor of 1.4 for an increase and 0.7 for a decrease. This enhanced the depiction of sharpness, grayscale, and texture details in the images. The original images were annotated using the Make Sense software (https://www.makesense.ai/), resulting in the creation of the dataset. Small-Tailed Han sheep were assigned ID numbers from “sheep1” to “sheep114”. The annotations were used to generate XML-format annotation files, which included information such as folder names, image names, image paths, image sizes, ID numbers, and pixel coordinates of label boxes. The sheep face dataset construction was completed, with the training set, test set, and validation set randomly distributed in an 8:1:1 ratio. The effectiveness of sheep facial annotations is depicted in Figure 1. Intercepted image resolution of 1920 × 1080 pixels. The different colored boxes shown in the diagram represent different sheep.

2.2. SSD Model

The SSD (Single Shot MultiBox Detector) algorithm is a one-stage multibox prediction method [25]. After extracting features using convolutional neural networks, it uniformly samples var SSD s positions in the image at different scales and aspect ratios for dense sampling, object classification, and prediction frame regression. The structural diagram is shown in Figure 2.

2.3. SSD Model Improvements

In this study, SSD served as the foundational network for recognizing sheep faces. To tackle the challenges related to extensive model parameters and model complexity, and the long detection times associated with the SSD algorithm, an improved SSD model was proposed in this paper. First, the VGG (Visual Geometry Group) backbone network was replaced with the MobileNetv3 network. Second, considering the predominant white color and visual similarity among sheep, the ECA attention mechanism was implemented to emphasize crucial areas in sheep facial features, thereby boosting the model’s proficiency in extracting essential information. Finally, the BalancedL1 loss function replaced the SmoothL1 position loss to minimize the disparity between predictions and ground truth data, thereby enhancing the model’s detection accuracy.

2.3.1. Backbone Network Replacement

To simplify the model structure and achieve a faster detection speed, this paper replaced the VGG backbone network in the original SSD algorithm with MobileNetv3.
MobileNetv3, proposed by Google, significantly improves the model’s detection speed by extracting features. Additionally, the traditional 3 × 3 convolution structure was replaced with a depthwise separable convolution structure, reducing the model’s size [26]. MobileNetv3 is characterized by its special “bneck” structure. It incorporates depthwise separable convolutions and a new nonlinear h-swish activation function, resulting in faster computation and greater suitability for lightweight applications. The bneck structure is illustrated in Figure 3.
MobileNetv3 is a compact convolutional neural network specifically designed for efficient inference on mobile and embedded devices. In MobileNetv3, the bneck (bottleneck module) is one of the key components.
The bneck module, also known as the inverted residual bottleneck block, is inspired by the bottleneck structure of ResNet, but inverted and adapted to fit the specific needs of mobile devices.
The main goal of the bneck block is to reduce the computational complexity while maintaining the model performance. It accomplishes this through two main technical improvements.
Inverted residuals: in the traditional ResNet architecture, the bottleneck module performs a dimensionality reduction (1 × 1 convolution) followed by an upscaling (3 × 3 convolution) operation. The bneck module in MobileNetv3, on the other hand, reverses this process. It performs the upscaling (1 × 1 convolution), then the downscaling (3 × 3 depth-separable convolution), and finally the 1 × 1 convolution to bring the number of channels back down. Such an inverted structure reduces information loss during dimensionality reduction and improves the model performance.
Linear bottleneck: In the traditional ResNet structure, the upscaling and downscaling process of the bottleneck module uses nonlinear activation functions (e.g., ReLU), which may result in information loss and the accumulation of nonlinearities. In contrast, the bneck module in MobileNetv3 employs a linear activation function, i.e., a linear activation with bias (Linear activation with bias), which avoids the over-accumulation of nonlinear costs.
Overall, the bneck module in MobileNetv3 reduces the computational complexity and maintains the model performance by inverting the residuals and linear bottlenecks, which enables the network to efficiently perform tasks like image classification, object detection, etc., on mobile and embedded devices. A comparison of the residual and inverse residual structures is shown in Figure 4.
Depthwise separable convolution mainly includes two stages: DW (depthwise convolution) and PW (pointwise convolution). In the model, 3 × 3 depthwise separable convolutions replace traditional 3 × 3 convolution structures. Compared to traditional convolutions, this method reduces the number of parameters to approximately 1/9th [27]. Figure 5 illustrates the structure of depthwise separable convolutions.
This study replaced the SSD backbone network VGG with four different lightweight backbone networks: MobileNetv2, MobileNetv3, ShuffleNetv1, and SqueezeNet. A comparison of these four lightweight backbone networks is provided in Table 1. Compared to other models, MobileNetv3 excels in terms of detection accuracy, model size, and detection speed. MobileNetv3 achieves a higher mAP compared to MobileNetv2, ShuffleNetv1, and SqueezeNet, with a superiority of 2.24%, 0.31%, and 1.63%, respectively. In terms of model size, MobileNetv3 is smaller by 34.3 MB, 66 MB, and 13.5 MB compared to MobileNetv2, ShuffleNetv1, and SqueezeNet, respectively. The detection speed is also faster by 1.79 frames/s, 3.45 frames/s, and 0.91 frames/s compared to MobileNetv2, ShuffleNetv1, and SqueezeNet, respectively. Therefore, the introduction of MobileNetv3 as the lightweight backbone network in this model proves to be superior in its application.

2.3.2. Introduction of Attention Mechanisms

The ECA mechanism argues that SE leads to negative optimization of prediction and proposes the concept of a channel exchange attention mechanism.
The ECA mechanism removes the fully connected layer in the SE mechanism and uses 1 × 1 convolutional layer directly on the global average pooled features to obtain channel information. In the ECA mechanism, the input feature maps are enhanced with channel features before output. This design approach enables the ECA mechanism to reduce the number of model parameters and computational complexity while guaranteeing the model performance, further realizing the design of lightweight networks [28].
In Figure 6, the left side illustrates the SE attention mechanism, while the right side demonstrates the ECA attention mechanism. The ECA attention mechanism operates by taking the feature map with dimensions H × W × C as input. It then undergoes spatial feature compression and global average pooling in the spatial dimension to produce a 1 × 1 × C feature map. Subsequently, channel feature learning is performed on this 1 × 1 × C feature map to understand the correlations between different channels through 1 × 1 convolutional learning associations, resulting in an output feature map of 1 × 1 × C. Finally, the 1 × 1 × C feature map is multiplied channel-by-channel with the input feature map H × W × C to generate the feature map with channel attention.
The SE attention mechanism uses fully connected layers in order to learn globally after processing the input channel feature maps, and switching to 1 × 1 convolution in the ECA attention mechanism can only learn local information between channels, so the dynamic convolution kernel is introduced to solve the problem of insufficient learning information.
The dynamic convolution kernel is designed with a convolution kernel adaptive function that dynamically adjusts the size of the convolution kernel according to the characteristics of the input data. For layers with a high number of channels, a larger convolution kernel is employed for performing 1 × 1 convolution operations. This leads to increased cross-channel interactions. Conversely, for layers with a lower number of channels, a smaller convolution kernel is used for 1 × 1 convolution operations, thereby reducing the number of cross-channel interactions.

2.3.3. BalancedL1 Loss Function

The loss function of the SSD comprises the smoothL1 location loss function for bounding boxes and the Softmax confidence loss function for object categories. This can be seen in Equation (1).
L ( x , c , l , g ) = 1 N ( L c o n f ( x , c ) + α L l o c ( x , l , g ) )
In the equation, N denotes the number of positive samples for the prior boxes. c represents the predicted values for category confidence, l is the predicted values for the location of the bounding boxes corresponding to the prior boxes, and g represents the location parameters of the Ground Truth.
This study enhances the model’s detection accuracy and reduces the error between model predictions and actual values by replacing the smoothL1 location loss function in SSD with the BalancedL1 loss function.
BalancedL1, inspired by the smoothL1 loss function, categorizes inliers and outliers by establishing an inflection point and limiting gradients for outliers using max (p, 1.0). In comparison to smoothL1, BalancedL1 substantially boosts the gradients of inliers points [29]. The fundamental concept of BalancedL1 is to improve the essential regression gradients and balance the included samples, resulting in a more balanced training in classification, overall localization, and precise localization. The equation for the detection box regression loss of BalancedL1 is presented as follows (2).
L l o c = i x , y , w , h K L b ( t i u v i )
The gradient corresponding to it can be seen in Equation (3).
L l o c w L b t i u L b x
A generalized gradient is formulated based on the above equations, as depicted in Equation (4).
L b x = α ln ( b x + 1 ) , i f x < 1 γ , o t h e r w i s e
In this equation, α controls the enhancement of inliers’ gradients; a relatively small α enhances the gradients of inliers while not affecting the values of outliers. γ is used to adjust the upper bound of regression errors, which can make different tasks more balanced. α and γ control the balance at both the sample and task levels, allowing for a more balanced training by adjusting these two parameters. The BalancedL1 loss function is represented in Equation (5), with the parameters meeting the conditions specified in Equation (6). The default parameter settings are α = 0.5 and γ = 1.5.
L b ( x ) = a b ( b x + 1 ) ln ( b x + 1 ) α x , i f x < 1 γ x + C , o t h e r w i s e
α ln ( b x + 1 ) = γ

2.4. Experimental Platform

In this study, the experimental platform parameters used for model training are shown in Table 2.

2.5. Metrics

The main performance metrics used to evaluate object detection models include Precision, Recall, mAP, FPS, and model size. The average IoU threshold for comparing predicted targets with actual targets is defined as 0.5. A prediction is labeled as a true positive if the IoU surpasses this threshold; otherwise, it is classified as a false negative.
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
In the formulas, TP denotes the count of true positive instances, FP represents the count of false positive instances, TN represents the count of true negative instances, and FN denotes the count of false negative instances.
Based on the training data, a Precision–Recall curve is plotted, resulting in:
A P = 0 1 P R d r
AP denotes the area under the PR curve and the axes.
In this study, since there are 114 classes of small-tailed sheep for detection, the mAP for this experiment is calculated as follows:
m A P = A P s h e e p 1 + A P s h e e p 2 + + A P s h e e p 114 114
Model parameters and computational speed are indeed related. Evaluating a model’s detection speed is primarily done by comparing the FPS, where a higher FPS indicates faster model detection speed. To calculate the average detection time, record the time (in seconds) taken for each frame’s detection through code and then calculate the average detection time.
F P S = 1 T

3. Results

3.1. Comparison of Experimental Results after Improvement of SSD

In order to assess the optimization effects of various improvement strategies on sheep face detection, research was conducted based on the SSD framework to investigate the effectiveness of the lightweight neural network MobileNetv3, the ECA attention mechanism, and the BalancedL1 loss function. The results of the experiments are shown in Table 3.
Replacing the original VGG backbone in the SSD model with MobileNetv3 resulted in a significant reduction in the model’s size and an increase in detection speed by 6.05 frames/s. However, it led to a slight decrease of 1.38 percentage points in mAP, indicating that the lightweight neural network MobileNetv3 effectively reduced the model size at the cost of a slight reduction in mAP. Introducing the ECA mechanism in the feature extraction network, specifically in the bottleneck layer with parameters 32 × 1024, resulted in an increase of 1.91 percentage points in mAP for the improved network model. The model size remained consistent, while the detection speed increased by 0.56 frames/s. This indicates the effectiveness of the ECA attention mechanism in improving the recognition of small-scale features in sheep faces. By replacing the original network’s smoothL1 loss function with BalancedL1 loss function, the average precision improved by 1.47 percentage points. The model’s detection speed also increased by 3.27 frames/s, suggesting that the BalancedL1 loss function, through enhanced gradients of inlier points, better aligns with real target boxes, resulting in improved detection accuracy and speed.
Given that attention mechanisms can be impacted by network configurations, the effectiveness of different attention mechanisms was evaluated using the SSD + v3 as the base network. Four attention modules, including CA, CBAM, SE, and ECA, were individually incorporated into the feature extraction network, with bottleneck layers of sizes 1122 × 16 (front-end) and 72 × 160 (backend). It is evident that the sizes of the models remained relatively consistent across all four attention mechanisms. CBAM2, ECA1, and ECA2 attention modules exhibited improvements in detection speed, with increments of 0.75, 0.34, and 1.6 frames/s, respectively. On the contrary, CA1, CA2, SE1, SE2, and CBAM1 attention mechanisms led to decreased detection speeds by 8.28, 5.29, 1.91, 2.41, and 3.08 frames/s, respectively. Additionally, CA2, SE1, SE2, and CBAM2 attention modules had a negative impact on the model’s detection accuracy, resulting in a reduction of mAP by 1.73, 0.25, 0.43, and 0.38 percentage points, respectively. These modules were evidently less appropriate for tasks involving sheep facial recognition detection. In contrast, CA1, CBAM1, ECA1, and ECA2 attention mechanisms contributed to a slight improvement in the network’s mAP, enhancing it by 0.52, 0.4, 2.02, and 2.47 percentage points, respectively. Therefore, the ECA2 attention mechanism introduced in this study demonstrated superior performance when applied to this model.

3.2. Improved Module Performance Comparison

In this study, three improvement methods were proposed: MobileNetv3, ECA, and BalancedL1. To evaluate their effectiveness, the following ablation experiments were designed on our custom sheep face dataset. The ablation experiments were conducted as follows: Starting with the original SSD network, each of the three enhancement methods was integrated individually to evaluate their impacts on the algorithm. The improved SSD algorithm was then stripped of each enhancement method to assess their effects on performance, as detailed in Table 4. Table 4 highlights that the ECA2 attention mechanism module led to the most notable accuracy enhancement, with a 1.91% increase in mAP compared to the original SSD algorithm. The model’s size remained relatively consistent, while the detection speed rose by 0.56 frames/s. The adoption of the lightweight neural network MobileNetv3 notably boosted detection speed, with an increase of 6.05 frames/s. The model’s volume reduced by 109.9 MB, but the mAP decreased by 1.38 percentage points. When compared to the SSD-v3-ECA2-B algorithm, the removal of the ECA2 attention mechanism module had a significant impact on accuracy, resulting in a decrease of 2.53 percentage points in mAP. The removal of the lightweight neural network MobileNetv3 had the most substantial impact on the model’s volume, increasing it by 109.6 MB and reducing the detection speed by 4.47 frames/s, impacting speed the most. The final proposed SSD-v3-ECA2-B algorithm, when compared to the original SSD algorithm, achieved a 3.25 percentage points increase in mAP on the sheep face dataset. The model’s volume decreased by 109.6 MB, and the detection speed improved by 9.55 frames/s, ensuring both higher detection accuracy and efficiency.
This paper conducted comparative experiments to assess the final improved algorithm’s effectiveness using the sheep face dataset, depicted in Figure 7. When the original images were input into the model, the face of each sheep was detected and labeled with different colored bounding boxes. Prior to the improvement, a few instances of false positives, missed detections, and duplicate detections were observed on some sheep faces. After the improvement, the false positive rate decreased, and the model’s predictions had higher confidence, resulting in better recognition performance. The different colored boxes shown in the diagram represent different sheep.

3.3. Comparison of Results from Different Network Models

The algorithm developed in this paper was compared with four mainstream network models, namely SSD, Faster R-CNN, Retinanet, and CenterNet, using the self-built sheep face dataset in the same environment. The comparison results are summarized in Table 5 and Figure 8. Our proposed SSD-v3-ECA2-B algorithm demonstrated significant improvements in both detection speed and accuracy over the other models. It achieved a mean average precision of 83.47%, a detection speed of 68.53 frames per second, and reduced the model size from 132 MB to 22.4 MB. Compared to SSD, Faster R-CNN, Retinanet, and CenterNet, it improved the mean average precision by 3.25, 4.71, 2.38, and 8.13 percentage points, respectively, and increased detection speed by 9.55, 58.55, 53.1, and 12.37 frames/s, respectively. In summary, the SSD-v3-ECA2-B algorithm proposed in this paper shows notable benefits in both detection accuracy and efficiency.
This study introduces a streamlined sheep facial detection algorithm, SSD-v3-ECA2-B, derived from the enhanced SSD framework. It provides a novel method for individual sheep identification. The algorithm replaces the SSD backbone VGG with MobileNetv3, significantly reducing the model’s size. It enhances the detection and recognition capability of small sheep facial targets using an attention mechanism. Furthermore, it effectively improves the model by replacing SSD’s smoothL1 loss function with BalancedL1, in which improving the gradient for inliers enhances the matching of actual target boxes, resulting in enhanced detection speed and accuracy.
The experimental findings demonstrate the effective detection of sheep face identity using the proposed methods in this study, achieving a mean average precision of 83.47%. The mean average precision has shown improvement compared to models like SSD, Faster R-CNN, Retinanet, and CenterNet, with an increase of 3.25, 4.71, 2.38, and 8.13%, respectively. The detection speed has also increased, by 9.55, 58.55, 53.1, and 12.37 frames/s. When applied to sheep facial identity recognition, the algorithm notably reduces the model size, enhances the detection accuracy, and improves the detection efficiency. It holds substantial practical value and can be a valuable reference for future sheep facial identity recognition endeavors.

3.4. Comparison with State-of-the-Art Models

In order to assess the performance of the proposed SSD-v3-ECA2-B for sheep face recognition, we conducted a comparison with a prior research’s model. The prior research utilized models such as YOLOv4, YOLOv3, and YOLOv7-tiny. The results of this comparison can be found in Table 6. It is evident from Table 6 that SSD-v3-ECA2-B is not as good as the other models though in terms of mAP, with mAP reduced by 10.89%, 7.15%, and 6.82% respectively. However, it outperformed the other models in terms of model size and frames per second (FPS) performance. The model size was reduced by 91.1%, 63.6%, and 3.1% respectively. Additionally, the FPS performance was improved by 73.7%, 87.3%, and 9.5%. These improvements make SSD-v3-ECA2-B more suitable for mobile applications. The comparison results demonstrate the significant advantages of SSD-v3-EC2-B in terms of model size and detection speed.

4. Discussion

This study introduces a lightweight sheep face recognition model, SSD-v3-ECA2-B, specifically designed for sheep face recognition. The SSD-v3-ECA2-B model excels in detection speed. Results from experiments on a custom sheep face dataset reveal that the SSD-v3-ECA2-B model attains an 83.47% mean average precision (mAP) with a significant reduction in volume from 132 MB to 22.4 MB. Moreover, the detection speed achieves 68.53 frames/s. A method is provided for the deployment of video streaming on mobile.
In this study, due to the limited experimental conditions, at this stage we have investigated a basic device for the automatic acquisition of sheep face pictures as a way to perform sheep face picture acquisition. As shown in Figure 9. It includes the sheep fixation device as well as the sheep channel. Therefore, in future research we will optimize this device to compensate for the lack of manual capture. We aim to make the future device as cost effective and efficient as possible.
Since only one breed, the small-tailed cold sheep from a specific region, was photographed, we plan to enhance the sheep face dataset by including facial images of different sheep breeds. This will help diversify the dataset for improved representation in the future [33].
In the long run, there is an urgent need to develop an embedded device at this stage to facilitate the operation of the face herders. In terms of the value of implementing such technology on farms, our work aims to provide several benefits to farmers. By utilizing deep learning and improving FPS, our technology can enable more efficient monitoring and analysis of farm activities. This can help farmers in various ways, such as:
  • Enhanced intellectualize: The improved FPS allows for real-time analysis of video feeds, enabling quicker detection and response to potential issues on the farm. This can help reducing manual labor and increasing overall operational efficiency [34].
  • Early disease detection: With faster and more accurate analysis of video data, our technology can assist in the early detection of diseases or abnormalities in crops or livestock. This can help farmers take timely preventive measures, minimizing losses and improving yield [35].
  • Precision farming: The application of deep learning technology can enable precise monitoring of individual plants or animals, allowing for targeted interventions. This can optimize resource utilization, such as resulting in improved sustainability and cost-effectiveness. Therefore, it is important to make the model as small as possible to facilitate the deployment of the embedded device. Our future research direction is to provide an accurate and efficient video stream recognition device for herders [36].

5. Conclusions

In this study, we apply deep learning techniques to sheep face recognition detection with SSD-based and we propose an improved lightweight sheep face detection model based on SSD.
The backbone network VGG16 in SSD is replaced with lightweight modules, including SqueezeNet, ShuffleNet, MobileNetv2, and MobileNetv3 modules. Lightweight modules were introduced in SSDs to reduce their model size. The ECA attention module is introduced in the backbone to select the critical information and suppress the non-critical information, and the smoothL1 loss function is replaced with the BalancedL1 loss function to improve the small target detection ability, thus improving the model performance.
From the results, the SSD-v3-ECA2-B model has the best recognition effect. The mAP recognition rate of the sheep face dataset reaches 83.47%. The harm to individual animals caused by traditional recognition methods can be avoided by a recognition method more favorable to animal welfare. The model size of SSD-v3-ECA2-B is only 22.4 MB, and the detection speed is 68.53 frames/s.
The research results provide technical support for the development of animal identification technology. It has important practical value and is an important reference value for future lightweight sheep face identification. The research results provide technical support for the development of mobile sheep face identification system.

Author Contributions

Conceptualization, Q.S. and M.H.; methodology, Q.S. and M.H.; software, Q.S.; validation, Q.S., M.Z. and S.S.; formal analysis, M.H. and C.X.; investigation, X.Z. and M.Z.; resources, C.X.; data curation, Q.S., X.Z. and S.S.; writing—original draft preparation, Q.S. and M.H.; writing—review and editing, Q.S. and M.H.; visualization, Q.S.; supervision, M.H. and C.X.; project administration, C.X.; funding, M.H. and C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the project from Basic Research Operating Expenses of Colleges and Universities directly under Inner Mongolia (grant No. BR221032).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to confidentiality.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Li, G.; Ma, C. Problems and countermeasures of grassland resources in China. Shanxi Agric. Econ. 2020, 84+86. [Google Scholar] [CrossRef]
  2. Li, W.; Ali, S.; Zhang, Q. Property rights and grassland degradation: A study of the Xilingol pasture, Inner Mongolia, China. J. Environ. Manag. 2007, 85, 461–470. [Google Scholar]
  3. La, M.; Tsaijang, Y. Research on animal image recognition in animal husbandry based on convolutional neural network. Software 2020, 41, 43–45. [Google Scholar]
  4. Alam, N.; Zhao, Y.; Koubâa, A.; Wu, L.; Khan, R.; Abdalla, F. Automated sheep facial expression classification using deep transfer learning. Comput. Electron. Agric. 2020, 175, 105528. [Google Scholar]
  5. Drolma, N. Strict grassland law enforcement supervision and protection of grassland resources in Yushu. China Anim. Husb. Vet. Dig. 2017, 33, 12. [Google Scholar]
  6. Yu, L.; Chen, Y.; Sun, W.; Huang, Y. Effects of grazing exclusion on soil carbon dynamics in alpine grasslands of the Tibetan Plateau. Geoderma 2019, 353, 133–143. [Google Scholar] [CrossRef]
  7. Zhang, X.; Xuan, C.; Ma, Y.; Su, H.; Zhang, M. Biometric facial identification using attention module optimized YOLOv4 for sheep. Comput. Electron. Agric. 2022, 203, 107452. [Google Scholar] [CrossRef]
  8. Benke, K.K.; Sheth, F.; Betteridge, K.; Pettit, C.J.; Aurambout, J.P. Application of geovisual analytics to modelling the movements of ruminants in the rural landscape using satellite tracking data. Int. J. Digit. Earth 2015, 8, 579–593. [Google Scholar] [CrossRef]
  9. Han, D. Research on Identification Method for Detecting Grazing Behavior of Grassland Grazing Sheep; Inner Mongolia Agricultural University: Hohhot, China, 2018. [Google Scholar]
  10. Wei, B. Sheep Face Detection and Recognition Based on Deep Learning; Northwest Agriculture and Forestry University: Xianyang, China, 2020. [Google Scholar] [CrossRef]
  11. Tian, F.; Li, J.; Li, F.; Han, Y.; Wang, Q. Progress in the determination of feed intake in ruminants. Chin. J. Anim. Husb. 2006, 62. [Google Scholar]
  12. Zhang, X.; Xuan, C.; Xue, J.; Chen, B.; Ma, Y. LSR-YOLO: A High-Precision, Lightweight Model for Sheep Face Recognition on the Mobile End. Animals 2023, 13, 1824. [Google Scholar] [CrossRef]
  13. Tian, F.; Li, F.; Li, J.; Han, Y. Design and technological research on feed intake tester for dairy cows. J. Instrum. 2007, 28, 293–297. [Google Scholar]
  14. Sun, Z.; Zhou, D.; Lou, Y. Grazing behavior of velvet goats in artificial goat grassland in Songnen Plain. China Grassl. J. 2011, 33, 72–76. [Google Scholar]
  15. Zhang, X.; Xuan, C.; Ma, Y.; Su, H. A high-precision facial recognition method for small-tailed Han sheep based on an optimised Vision Transformer. Animal 2023, 17, 100886. [Google Scholar] [CrossRef]
  16. Nan, Y.; Ju, J.; Hua, Q.; Zhang, H.; Wang, B. A-MobileNet: An approach of facial expression recognition. Alex. Eng. J. 2022, 61, 4435–4444. [Google Scholar] [CrossRef]
  17. Prayogo, R.B.R.; Suciati, N.; Hidayati, S.C. Masked face recognition on mobile devices using deep learning. AIP Conf. Proc. 2023, 2508, 020017. [Google Scholar]
  18. Ma, C.; Deng, M.; Yin, Y. Pig face recognition based on improved YOLOv4 lightweight neural network. Inf. Process. Agric. 2023; in press. [Google Scholar] [CrossRef]
  19. Li, X.; Du, J.; Yang, J.; Li, S. When Mobilenetv2 Meets Transformer: A Balanced Sheep Face Recognition Model. Agriculture 2022, 12, 1126. [Google Scholar] [CrossRef]
  20. Gonzales Barron, U.; Corkery, G.; Barry, B.; Butler, F.; McDonnell, K.; Ward, S. Assessment of retinal recognition technology as a biometric method for sheep identification. Comput. Electron. Agric. 2008, 60, 156–166. [Google Scholar] [CrossRef]
  21. Guo, Y.; Yu, Z.; Hou, Z.; Zhang, W.; Qi, G. Sheep face image dataset and DT-YOLOv5s for sheep breed recognition. Comput. Electron. Agric. 2023, 211, 108027. [Google Scholar] [CrossRef]
  22. Pahl, C.; Hartung, E.; Grothmann, A.; Mahlkow-Nerge, K.; Haeussermann, A. Suitability of feeding and chewing time for estimation of feed intake in dairy cows. Animal 2015, 10, 1507–1512. [Google Scholar] [CrossRef] [PubMed]
  23. Braun, U.; Tschoner, T.; Hässig, M. Evaluation of eating and rumination behaviour using a noseband pressure sensor in cows during the peripartum period. BMC Vet. Res. 2014, 10, 195. [Google Scholar] [CrossRef]
  24. Zhang, C.; Zhang, H.; Tian, F.; Zhou, Y.; Zhao, S.; Du, X. Research on sheep face recognition algorithm based on improved AlexNet model. J. Neural Comput. Appl. 2023, 35, 24971–24979. [Google Scholar] [CrossRef]
  25. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer International Publishing: Berlin/Heidelberg, Germany, 2015. [Google Scholar] [CrossRef]
  26. Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar]
  27. Yu, Z.; Fang, H.; Zhangjin, Q.; Mi, C.; Feng, X.; He, Y. Hyperspectral imaging technology combined with deep learning for hybrid okra seed identification. Biosyst. Eng. 2021, 212, 46–61. [Google Scholar] [CrossRef]
  28. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional netural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE Press: Piscataway, NJ, USA, 2020; pp. 11531–11539. [Google Scholar]
  29. Lu, H.; Lei, Y.; Wang, J.; Xing, X.; Yang, J. Transmission line insulator identification based on improved Libra-RCNN. Hunan Electr. Power 2022, 42, 44–49. [Google Scholar]
  30. Yang, S.; Liu, Y.; Wang, Z.; Han, Y.; Wang, Y.; Lan, X. Improved YOLO V4 model based on fused coordinate information to recognize cow face. J. Agric. Eng. 2021, 37, 129–135. [Google Scholar]
  31. Song, S.; Liu, T.; Wang, H.; Hasi, B.; Yuan, C.; Gao, F.; Shi, H. Using Pruning-Based YOLOv3 Deep Learning Algorithm for Accurate Detection of Sheep Face. Animals 2022, 12, 1465. [Google Scholar] [CrossRef] [PubMed]
  32. Qi, Y.; Jiao, J.; Bao, T.; Wang, C.; Du, X. Cow face detection algorithm in complex scenes based on adaptive attention mechanism. J. Agric. Eng. 2023, 39, 173–183. [Google Scholar]
  33. Pang, Y.; Yu, W.; Zhang, Y.; Xuan, C.; Wu, P. Sheep face recognition and classification based on an improved MobilenetV2 neural network. Int. J. Adv. Robot. Syst. 2023, 20, 17298806231152969. [Google Scholar] [CrossRef]
  34. Huang, Z.; Xu, A.; Zhou, S.; Ye, J.; Weng, X.; Xiang, Y. A key point detection method for pig face by integrating reparameterization and attention mechanism. J. Agric. Eng. 2023, 39, 141–149. [Google Scholar]
  35. Wang, H. Research on the Diagnosis Method of Dairy Cattle Disease by Integrating Knowledge Graph and Deep Learning; Northeast Agricultural University: Harbin, China, 2023. [Google Scholar]
  36. Kong, L. Research on Animal Target Recognition and Tracking Method Based on Deep Learning; University of Electronic Science and Technology: Chengdu, China, 2023. [Google Scholar] [CrossRef]
Figure 1. Sheep face labeling map.
Figure 1. Sheep face labeling map.
Agriculture 14 00468 g001
Figure 2. SSD Structure Diagram.
Figure 2. SSD Structure Diagram.
Agriculture 14 00468 g002
Figure 3. Bneck structure diagram.
Figure 3. Bneck structure diagram.
Agriculture 14 00468 g003
Figure 4. Residual and Inverse Residual Structures.
Figure 4. Residual and Inverse Residual Structures.
Agriculture 14 00468 g004
Figure 5. Depthwise Separable Convolution Structure Diagram.
Figure 5. Depthwise Separable Convolution Structure Diagram.
Agriculture 14 00468 g005
Figure 6. SE attention mechanism and ECA attention mechanism modules.
Figure 6. SE attention mechanism and ECA attention mechanism modules.
Agriculture 14 00468 g006
Figure 7. Comparison of SSD-v3-ECA2-B and SSD model detection effects.
Figure 7. Comparison of SSD-v3-ECA2-B and SSD model detection effects.
Agriculture 14 00468 g007
Figure 8. Results of different models.
Figure 8. Results of different models.
Agriculture 14 00468 g008
Figure 9. The sheep facial image acquisition device.
Figure 9. The sheep facial image acquisition device.
Agriculture 14 00468 g009
Table 1. Comparison table of four lightweight backbone networks.
Table 1. Comparison table of four lightweight backbone networks.
Algorithm ModelMean Average
Precision mAP/%
Model
Size/MB
FPS/(Frames·s−1)
SSD + MobileNetv276.6056.463.24
SSD + MobileNetv378.8422.165.03
SSD + ShuffleNetv178.5388.161.58
SSD + SqueezeNet77.2135.664.12
Table 2. Parameters of the experimental platform.
Table 2. Parameters of the experimental platform.
ConfigurationSpecification
OSUbuntu 20.04
CPUXeon(R) Platinum 8350C
GPURTX 3090
Application Software PackagePython 3.8 and Pytorch 1.10
Table 3. Comparison of experimental results after improvement of SSD.
Table 3. Comparison of experimental results after improvement of SSD.
Algorithm ModelMean Average Precision mAP/%Model Size/MBFPS/(Frames·s−1)
SSD80.2213258.98
SSD + v378.8422.165.03
SSD + ECA82.1313259.54
SSD + B81.6913262.13
SSD + v3 + CA179.3622.556.75
SSD + v3 + CA277.1122.559.47
SSD + v3 + SE178.5922.463.12
SSD + v3 + SE278.4122.462.62
SSD + v3 + CBAM179.2422.661.95
SSD + v3 + CBAM278.4622.665.78
SSD + v3 + ECA180.8622.465.37
SSD + v3 + ECA281.3122.466.63
Note: v3 = MobiileNetv3, B = BlancedL1, Attention mechanism 1 = 1122 × 16 of the bottleneck layer backend, Attention mechanism 2 = 72 × 160 of the bottleneck layer backend.
Table 4. Results of ablation experiment.
Table 4. Results of ablation experiment.
v3BECA2mAP/%Model Size/MBFPS/(Frames·s−1)
80.2213258.98
78.8422.165.03
81.6913262.13
82.1313259.54
82.4713262.16
81.3122.466.63
80.9422.169.41
83.4722.468.53
Note: “√” indicates the introduction of modification methods.
Table 5. Results of different detection models.
Table 5. Results of different detection models.
Algorithm ModelmAP/%Model Size/MBFPS/(Frames·s−1)
SSD80.2213258.98
Faster R-CNN78.761089.98
Retinanet81.0914515.43
CenterNet75.3412456.16
SSD-v3-ECA2-B83.4722.468.53
Table 6. Comparison of the research results of other sheep face recognition.
Table 6. Comparison of the research results of other sheep face recognition.
ModelmAP/%Model Size/MBFPS/(Frames·s−1)
Yang et al. (2021) YOLOv4 [30]93.6825318
Song et al. (2022) YOLOv3 [31]89.961.58.7
Qi et al. (2023) YOLOv7-tiny [32]89.5823.162
SSD-v3-ECA2-B83.4722.468.53
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hao, M.; Sun, Q.; Xuan, C.; Zhang, X.; Zhao, M.; Song, S. Lightweight Small-Tailed Han Sheep Facial Recognition Based on Improved SSD Algorithm. Agriculture 2024, 14, 468. https://doi.org/10.3390/agriculture14030468

AMA Style

Hao M, Sun Q, Xuan C, Zhang X, Zhao M, Song S. Lightweight Small-Tailed Han Sheep Facial Recognition Based on Improved SSD Algorithm. Agriculture. 2024; 14(3):468. https://doi.org/10.3390/agriculture14030468

Chicago/Turabian Style

Hao, Min, Quan Sun, Chuanzhong Xuan, Xiwen Zhang, Minghui Zhao, and Shuo Song. 2024. "Lightweight Small-Tailed Han Sheep Facial Recognition Based on Improved SSD Algorithm" Agriculture 14, no. 3: 468. https://doi.org/10.3390/agriculture14030468

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop