Potted Phalaenopsis Grading: Precise Bloom and Bud Counting with the PA-YOLO Algorithm and Multiviewpoint Imaging

Yang, Yi; Zhang, Guankang; Ma, Shutao; Wang, Zaihua; Liu, Houcheng; Gu, Song

doi:10.3390/agronomy14010115

Open AccessArticle

Potted Phalaenopsis Grading: Precise Bloom and Bud Counting with the PA-YOLO Algorithm and Multiviewpoint Imaging

¹

College of Electronic Engineering, South China Agricultural University, Guangzhou 510642, China

²

Environmental Horticulture Research Institute, Guangdong Academy of Agricultural Sciences, Guangzhou 510640, China

³

Guangdong Provincial Key Lab of Ornamental Plant Germplasm Innovation and Utilization, Guangzhou 510640, China

⁴

College of Horticulture, South China Agriculture University, Guangzhou 510642, China

⁵

College of Engineering, South China Agricultural University, Guangzhou 510642, China

⁶

Key Laboratory of Key Technology on Agricultural Machine and Equipment, Ministry of Education, South China Agricultural University, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Agronomy 2024, 14(1), 115; https://doi.org/10.3390/agronomy14010115

Submission received: 17 November 2023 / Revised: 10 December 2023 / Accepted: 21 December 2023 / Published: 2 January 2024

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

The accurate detection and counting of flowers ensure the grading quality of the ornamental plants. In automated potted flower grading scenarios, low detection precision, occlusions and overlaps impact counting accuracy. This study proposed a counting method combining a deep learning algorithm with multiple viewpoints. Firstly, a flower detection model, PA-YOLO, was developed based on YOLOv5 by designing a two-scale detection branch, optimizing the number of bottlenecks and integrating a dynamic head framework. Next, PA-YOLO was used to detect grouped 360-viewpoint images of each potted plant to determine the optimal number of viewpoints for counting. The detection results indicated that PA-YOLO achieved a mean average precision (mAP) of 95.4% and an average precision (AP) of 91.9% for occluded blooms on our Phalaenopsis flower dataset. For the optimal number of viewpoints, the average counting accuracy of buds and blooms was highest at three viewpoints, with scores of 96.25% and 93.33%, respectively. The final counting accuracy reached 95.56% in flower counting tests conducted from three viewpoints. The overall results suggest that the proposed method can effectively detect and count flowers in complex occlusion and overlap environments, providing guidance for designing and implementing the vision component in an automated potted flower grading system.

Keywords:

flower counting; deep learning; multiple viewpoints; flower detection; potted Phalaenopsis; machine vision

1. Introduction

Phalaenopsis is the most important orchid crop globally, with a market share of 79% of all orchids sold [1]. In total, the Asian market for Phalaenopsis is estimated to be 84 million pots annually, with China accounting for 60 million, Japan for 14 million and Southeast Asia for 10 million [2]. Prior to being sold on the market, potted flowers must be graded to ensure consistency in quality. Counting the number of flowers is an important step in quality classification. However, the number of flowers is usually derived from manual counting. Manual counting is time-consuming and laborious. In addition, the stability of manual counting is difficult to guarantee. Consequently, it is essential to develop alternative methods for efficiently and nondestructively counting flowers in order to estimate the quality of potted Phalaenopsis.

Machine vision is one of the feasible solutions and there is an increasingly extensive application of machine vision technology in the agricultural field [3,4]. Quantifying the number of flowers by machine vision usually requires two steps: detecting where the flowers are on the plant and counting how many there are. To achieve accurate flower counting, it is essential to first accomplish precise flower detection. Early research on flower detection was mainly based on clustering segmentation, threshold segmentation or traditional machine learning. In regard to clustering segmentation, Aleya and Samanta [5] segregated flowers from backgrounds with a k-means clustering algorithm and identified damaged flowers based on the histogram distribution of flowers. In regard to threshold segmentation, Aggelopoulou et al. [6] positioned a black backdrop behind apple trees to acquire images at a blooming stage and extracted flower data depending on the binary image threshold. Horton et al. [7] employed contrast stretching of three color bands to augment multispectral images and color thresholding to detect peach flowers. For machine learning, Wang et al. [8]. utilized a speeded-up robust features (SURF) algorithm to extract flower features and support vector machine (SVM) classification to further segment mango flowers. These traditional detection methods rely heavily on biological features, such as colors and textures, and artificial features to extract information; however, the generalization ability of these methods drops significantly when the application domain or environment changes.

Since Krizhevsky et al. [9] proposed AlexNet in 2012 and achieved the best classification performance on the ImageNet dataset, researchers have made rapid progress in the development of deep neural networks. Target detectors such as Faster R-CNN [10] and the YOLO series [11,12,13] have been successively proposed with outstanding results. As a result, deep learning approaches have been increasingly and widely used in agricultural fields [4,14,15], providing new ideas for automatic flower detection with powerful feature learning capabilities. Jiang et al. [16] trained a Faster R-CNN model to detect cotton flowers, with a detection precision of 86% on a five-class dataset. In the same year, Wu et al. [17] proposed a lightweight network for apple flower detection by simplifying the original YOLOv4 model using a channel pruning algorithm. The mean average precision (mAP) of detection was up to 97.31%. Research on apple flower detection was also conducted by Tian et al. [18], using MASU R-CNN (an improved Mask R-CNN with a backbone of U-Net) with a precision of 96.43%, and by Shang et al. [19], using an improved YOLOv5s model with a mAP of 91.80%. In addition, Qi et al. [20] performed tea chrysanthemum detection using a new architecture, TC-YOLO, and achieved an average precision (AP) of 92.49%. Similar studies have been reported with grape flowers [21], tomato flowers [22], lychee flowers [23] and other flowers [24,25]. The existing studies on flower detection using deep neural networks have achieved good precision. However, these studies primarily focused on detecting dense and numerous flowers on various trees, including fruit trees and tea trees. The counting results based on these detections were typically used for flower thinning [26,27] or fruit yield prediction [6], which allowed for a certain level of tolerance for incorrect detections. In contrast, the number of flowers in ornamental plants can directly impact grading results and errors in quantity can lead to misjudgments in levels. Consequently, quantifying the number of flowers in ornamental plants requires high-precision detection with a low tolerance for incorrect detections, which mainly result from occlusions and overlaps between flowers. In other words, an excellent detector with a robustness to occlusions and overlaps is needed for flower counting.

Regarding the detection of potted flowers, Chang et al. [28] developed an automatic grading system for potted Phalaenopsis based on an improved YOLOv3 model. For this system, the mAP of detection reached 82%. Subsequently, Wang et al. [29] employed YOLOv4-Tiny and achieved a mAP of 89.72% on a dataset consisting of Poinsettia and Cyclamen images. Obviously, there is a potential for further enhancement in detection precision. In addition, most of the aforementioned deep learning methods were basically limited to a fixed single viewpoint, and occlusions and overlaps were not fully considered, which is not conducive to achieving accurate flower counting. To address this problem, Houtman et al. [30] proposed a nondeep learning method based on multiple hypothesis tracking (MHT) to count potted Phalaenopsis flowers. This method introduced multiple viewpoints and reached a maximum counting accuracy of 92% with a margin of one flower. Despite the improvement, the counting accuracy still fell short of the requirements for the precise grading of potted Phalaenopsis.

Therefore, in response to low detection precision, as well as occlusion and overlap issues in regard to counting, a novel counting method based on an effective target detector and multiple viewpoints was proposed. In summary, the objective of this study was to achieve accurate flower detection and counting of potted Phalaenopsis plants with occlusions and overlaps.

2. Materials and Methods

To fulfill the requirements of the subject detection illustrated in Figure 1, our work was systematically divided into three parts: the preparation of image data, flower detection and flower counting. This sequence also dictates the arrangement of the subchapters in the second chapter of this paper.

2.1. Image Acquisition

2.1.1. Multiviewpoint Imaging System

To address the occlusions and overlaps between flowers in two-dimensional (2D) images, it was necessary to enhance the model’s ability to detect occluded flowers at the algorithmic level and to obtain multiple viewpoints that can comprehensively display all blooms and buds of a potted Phalaenopsis plant, particularly in cases with heavy occlusions and overlaps. To this end, a multiviewpoint imaging system was first designed to collect potted Phalaenopsis images from multiple viewpoints.

As depicted in Figure 2a, the multiviewpoint imaging system comprised a rotation stage, a control unit, an industrial camera, camera support and a PC. The rotation stage used an RTS1410 model manufactured by Colibri (Shenzhen, China) and the corresponding control unit was provided to control the rotation angle and the rotation speed of the stage. The rotation angle was determined by the number of pulse signals transmitted by the control unit and its speed was controlled by the transmitting frequency of the pulse signals. The industrial camera was an A3504CG100 model manufactured by IRAYPLE (Hangzhou, China) with a resolution of 2592 × 1944 pixels. The PC was a Dell G3 3579 laptop (Dell Computer Corporation, Round Rock, TX, USA). Image capture was performed using the matching software MV Viewer v2.3.2_Build20220311.

The workflow of the multiviewpoint imaging system was as follows. Firstly, a sample was placed on the rotation stage, which was then set to rotating. After rotating to a certain angle, the stage remained stationary for a set period of time. During this period, the control unit sent a capture command to the PC via the serial port. Upon receiving the command, the PC utilized the camera to capture an image of the sample, which was then transmitted to the PC by the camera. The stage then resumed rotation and the system repeated the capture process until a stop command was issued. In this manner, multiviewpoint images of the sample were acquired.

2.1.2. Multiviewpoint Image Acquisition

With the help of the multiviewpoint imaging system, image acquisition tests of 33 potted Phalaenopsis samples were conducted at the South China Agricultural University in Guangdong Province in March 2023. The actual working scene of the multiviewpoint imaging system is shown in Figure 2b. The samples, produced by the Environmental Horticulture Research Institute, Guangdong Academy of Agricultural Sciences, were of the “Big Chili” species, with a cultivation period of 24 months. Due to the variable growth directions, inconsistent flowering angles, irregular shapes, and large differences in the sizes of blooms and buds of Phalaenopsis plants, occlusions and overlaps are common occurrences.

During the tests, the camera was mounted at a height of 0.5 m based on the average height of the samples and the working distance was set to 1.3 m to capture images from all angles within a fixed field of view. The rotation speed of the stage was set to 0.5° per second. For every degree of rotation of the stage, the system captured one image. In total, 360-degree images of 33 samples were collected, resulting in a total of 11,880 images. A subset of 10 samples, comprising a total of 3600 images, were used to train a flower detector. Figure 3a illustrates a collection of images from six viewpoints for one out of ten samples. The remaining 23 samples, consisting of 8280 images, were employed to investigate the optimal number of viewpoints for counting.

2.2. Flower Detection

2.2.1. Dataset Preparation

After obtaining the multiviewpoint images, given the high similarity between viewpoint images spaced one degree apart, one image was randomly selected from every three viewpoint images. Finally, 1200 images were selected from a total of 3600 across the 10 samples to create a dataset for training. The images were annotated using LabelImg, an image annotation tool, with the position and the class of targets annotated according to the YOLO format. Minimum bounding rectangles were used to label targets in the images, as illustrated in Figure 3b. Three classes were established based on the flowering state and the need for occlusion optimization: bud, normal bloom and occluded bloom (OB). The determination of whether a bloom was occluded was based on whether the degree of occlusion exceeded 10%. Some examples of occluded blooms are shown in Figure 3c. Meanwhile, a pair of overlapping blooms was only annotated as a single normal bloom, as depicted in Figure 3d. Ultimately, we obtained a three-class dataset comprising 1200 multiviewpoint images of the 10 potted Phalaenopsis plants. The dataset was split into training, validation and test sets. The statistical results for each set are reported in Table 1.

2.2.2. Experimental Setup

After the dataset was prepared, the models were built and trained using PyTorch 1.12.0 on the Ubuntu 22.04 system. The training process was accelerated by an NVIDIA GeForce RTX3090Ti GPU, with CUDA 11.3 and CUDNN 8.2. The CPU used was an Intel(R) Core(TM) i9-12900K.

During training, the key hyperparameter settings were as follows: epoch = 1000, initial learning rate = 0.01, momentum = 0.905, weight decay = 5 × 10⁻⁴ and batch size = 32. The optimizer used was Adam, while all other settings were left at their default values.

2.2.3. An Improved Flower Detection Architecture

To accurately count potted Phalaenopsis flowers, a target detector with high performance and robustness to occluded blooms was needed. As a typical one-stage detector, YOLOv5, released by Ultralytics, has demonstrated its advantages in detection precision and model complexity, resulting in its wide application in precision agriculture [31,32,33]. YOLOv5 is divided into five different models based on network depth and feature map width: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x. Among these, YOLOv5s strikes a balance among model size, inference speed and precision. Furthermore, when trained on our dataset, YOLOv5s achieved the highest mAP. Therefore, YOLOv5s was selected as our baseline.

As shown in Figure 4a, YOLOv5s uses the CSPDarknet53 architecture [12] with an SPPF layer [34] as the backbone for feature extraction, PANet [35] as the neck for feature fusion and YOLOv3 Head [36] as the head for predicting the location and the class of the targets. Our flower detection model, PA-YOLO, incorporated several novel ideas into the original YOLOv5s architecture in order to improve detection performance, as illustrated in Figure 4b. Firstly, a two-scale detection branch (2SDB) was designed to guide the network to concentrate on medium-scale and small-scale flower targets. Operating at two scales enabled the network to efficiently capture the characteristics of both medium- and small-scale flowers. Secondly, the number of bottlenecks at different C3 stages was optimized to enhance feature fusion and representation capabilities. This indicates that the network excels in extracting and consolidating features from diverse layers of the network, resulting in heightened precision in detections. Finally, an attention-based dynamic head (DyHead) framework [37] was integrated at the head. This assisted the network in prioritizing regions of the image where flowers might be partially obscured, thereby enhancing detection performance in challenging scenarios. These improvements enabled PA-YOLO to surpass YOLOv5s in flower detection due to targeted optimizations.

2.2.4. Design a Two-Scale Detection Branch

In a typical grading scenario, a single potted plant is taken as the detection target, with a certain distance between the camera and the potted flower required to ensure that all buds and blooms, whether visible or not, are fully captured within the field of view. Under this assumption, if target dimensions within an image are measured by their pixel occupancy ratio, then the buds and blooms captured in the acquired image are predominantly small or medium in size. Consequently, the detection branch used to predict large targets contributes little to the detection results. The layers related to large target detection do not provide benefits commensurate with the increase in the number of parameters, instead increasing network redundancy and impeding its ability to learn enhanced feature representation. As such, a two-scale detection branch was designed to replace the original three-scale branch, guiding the network to focus on medium-scale and small-scale flower targets.

As shown in the comparison between Figure 4a,b, the design of the two-scale detection branch involved the following steps. The C3 on the 9th layer of the backbone, which was mainly used to extract features for predicting large targets, was removed. The 22nd-layer convolution block, 23rd-layer concatenation (Concat) and 24th-layer C3 at the neck, which were combined to fuse and enhance features for detecting large targets, were also deleted. The prediction branch with a shape of 20 × 20 × 24 at the head was removed because it was no longer necessary for predicting large-scale targets.

2.2.5. Optimize the Number of Bottlenecks at Different C3 Stages

In YOLOv5, C3 modules are primarily responsible for learning and enhancing feature representation. As shown in the green rectangular dashed box in Figure 3a, the C3 architecture comprises two paths: the first path employs multiple bottlenecks and a standard convolutional block, while the second passes the input through a single basic convolutional block. The outputs of the two paths are then concatenated to form the final output. As the core component of the C3 module, the bottlenecks employ residual connections [38] and consist of two convolutional layers. The first layer is a 1 × 1 convolution that halves the original number of channels, while the second layer is a 3 × 3 convolution that doubles the number of channels. This dimensionality reduction allows the convolutional kernel to better understand feature information, while increasing the dimensions helps extract more detailed features. As such, the number of bottlenecks is associated with the network’s ability to represent features. To enhance the network’s ability to detect flowers, the number of bottlenecks at different C3 modules in the backbone and neck was optimized.

In the backbone, the number of bottlenecks in C3_2 was increased from 2 to 3, as shown in Figure 5a. In other words, the number of bottlenecks at each C3 stage in the backbone was optimized from 1-2-3 to 1-3-3. This adjustment not only enhanced the extraction of fine-grained features, but also aligned with C3_3, another input of the neck, to maintain an equal number and balance the network’s ability to extract features at different levels of granularity.

In the neck, all C3 modules are attached after concatenation operations to integrate features and further learn the fused features; this is because the neck’s function is to merge shallow graphic features with deep semantic features to obtain more comprehensive features. As a result, adjustments to the number of bottlenecks in the neck should be synchronized. To avoid excessive network depth and numerous redundant gradients, the number of bottlenecks at each C3 stage in the neck was ultimately optimized from 1-1-1 to 2-2-2, as shown in Figure 5b.

It is important to note that the determination of the number of bottlenecks was also based on experimentation. During our tests, various configurations, such as 1-1-1, 1-2-2, 2-2-2 and 1-4-4 in the backbone, and 1-1-1 and 3-3-3 in the neck, were evaluated. Ultimately, it was found that using a 1-3-3 configuration in the backbone and a 2-2-2 configuration in the neck yielded the best detection performance for the network.

2.2.6. Integrate a Dynamic Head Framework

Previous improvements have focused on improving the overall performance of the model in detecting flowers. However, detecting occluded blooms is also a key issue to be addressed at the algorithmic level in this study. To improve the model’s ability to detect occluded blooms, more potential features must be activated. For example, for blooms whose centers are occluded, more attention should be given to the features of the edges of petals to distinguish them from normal blooms. Inspired by Dai et al. [37], DyHead was integrated to further utilize the high-resolution multiscale semantic information generated by PANet to enhance the activation of occluded bloom features.

DyHead is a unified object detection head that employs a self-attention mechanism to enhance scale awareness across feature levels, spatial awareness across spatial locations and task awareness across output channels. The overall architectural design of DyHead is as follows [37].

Firstly, the features of different levels from the neck are resized toward the median level features using either upsampling or downsampling. This generates a three-dimensional tensor

F \in R^{L \times S \times C}

, where L denotes the number of levels from the neck,

S

is defined as

H \times W

(where

H

,

W

denotes the height and width of the feature) and

C

denotes the number of channels of the feature.

For the input

F

, as shown in Figure 6, DyHead employs a separated attention function that implements three sequential attentions, each operating on only one dimension:

W (F) = π_{c} (π_{s} (π_{L} (F) \cdot F) \cdot F)

(1)

where

π_{_{L}} (\cdot)

,

π_{_{s}} (\cdot)

and

π_{_{C}} (\cdot)

are three different attention functions for dimensions

L

,

S

and

C

, respectively.

The scale-aware attention

π_{_{L}} (\cdot)

is expressed as:

π_{L} (F) = σ (f (\frac{1}{s c} \sum_{s, c} F)) \cdot F

(2)

where

f (\cdot)

is a linear function implemented approximately by a 1 × 1 convolutional layer and

σ (\cdot)

is a hard-sigmoid function.

π_{_{s}} (\cdot)

is a spatial-aware attention. Since

S

consists of two dimensions,

H

and

W

, this module is decomposed into two steps. The first step involves using deformable convolution [39] to make the attention learning sparse. The second step involves aggregating features across levels at the same spatial locations:

π_{s} (F) \cdot F = \frac{1}{L} \sum_{l = 1}^{L} \sum_{k = 1}^{K} w_{l, k} \cdot F (l; p_{k} + Δ p_{k}; c) \cdot Δ m_{k}

(3)

where

K

denotes the quantity of sparse sampling locations,

p_{k} + Δ p_{k}

represents a shifted location affected by the self-learned spatial offset

p_{k}

and

Δ m_{k}

is a self-learned importance scalar at location

p_{k}

.

π_{_{C}} (\cdot)

represents a task-aware attention function that dynamically switches channels of features on and off for different tasks. The function is defined as follows:

π_{c} (F) \cdot F = m a x (α^{1} (F) \cdot F_{c} + β^{1} (F), α^{2} (F) \cdot F_{c} + β^{2} (F))

(4)

where

{[α^{1}, α^{2}, β^{1}, β^{2}]}^{T} = Θ (\cdot)

represents a learnable hyperfunction that controls activation thresholds and

F_{c}

denotes the feature slice at the c-th channel.

The detailed implementation of DyHead is shown in Figure 6. The DyHead block can be stacked multiple times according to Equation (1), in order to achieve different performances. In this study, the attention integration module was stacked four times.

2.2.7. Evaluation Metrics

In this study, the detection performance of the models was comprehensively evaluated using precision (P), recall (R), F1 score (F1), AP and mAP as metrics. Among these, mAP is the mean value of AP of all classes under a given intersection over union (IoU) threshold. F1 is the harmonic mean of precision and recall. Higher mAP and F1 scores indicate a better detection performance of the network. These metrics are defined by the following equations:

P = \frac{T P}{T P + F P} \times 100 %

(5)

R = \frac{T P}{T P + F N} \times 100 %

(6)

F 1 = \frac{2 \times P \times R}{P + R}

(7)

A P = \int_{0}^{1} P (R) d R

(8)

m A P = \frac{1}{3} \underset{i = 1}{\sum^{3}} A P_{i}

(9)

where in a specific task,

T P

(true positive),

F P

(false positive) and

F N

(false negative) refer to correct detections, false detections and missed detections, respectively.

In addition, to evaluate the complexity of the proposed model, the number of parameters (Params) and floating-point operations (FLOPs) were introduced. Params represents a total number of parameters in a model that can be trained, while FLOPs represent a total number of floating-point operations required by a model. Params and FLOPs indicate the space complexity and the time complexity to some extent.

2.3. Flower Counting

PA-YOLO enables the detection of flowers, but detection based on a single view cannot resolve all cases of heavy occlusions and overlaps, even with the classification of occluded blooms into a separate class and improvements to the original model’s detection performance for occluded blooms at the algorithmic level. To address this problem, a flower counting method using PA-YOLO based on multiple viewpoints was proposed.

2.3.1. The Optimal Number of Viewpoints

Increasing the number of viewpoints can improve counting accuracy by making blooms that are occluded or overlapped in one viewpoint, visible in another. Consequently, it is generally believed that the more viewpoints there are, the better the results. However, given hardware costs and the potential impact of incorrect detections, it is crucial to find the optimal number of viewpoints for our counting method.

Capturing images of a single sample at one-degree intervals in a circular manner can effectively provide a comprehensive representation of all viewpoints of the sample. On this premise, our approach involved grouping the 360-degree images of each of the remaining 23 potted Phalaenopsis samples into W groups (where w = 360/v and v is the number of viewpoints), with each group containing V images. PA-YOLO was then used to detect and count flowers in all images within each group. The maximum values of the blooms and the buds were determined by comparing the results from the images within a group. These maximum values were used as the final counting results for that counted group and compared with the actual values. After all the W groups under a given number of viewpoints V were counted, the number of viewpoints V (with a total of 6 viewpoints) was changed and the above process was repeated. In this way, the overall flower counting results of each sample under different numbers of viewpoints were statistically obtained. The counting strategy used in the counting process is referred to as the “maximum values within a group” strategy. During this process, counting accuracy is used as the only evaluation criterion. The viewpoint counting accuracy (VCA) is defined as:

V C A = \frac{N_{C G}}{N_{A G}} \times 100 %

(10)

where

N_{C G}

is the number of correct groups and

N_{A G}

is the number of all groups counted at a given number of viewpoints.

2.3.2. Flower Counting Tests

After finding the optimal number of viewpoints for our method, flower counting tests on a circular conveyer system with a detection chamber were conducted to evaluate the effectiveness of our counting method, as shown in Figure 7a. The chamber was configured internally to align with the multiviewpoint imaging system, as depicted in Figure 7b. The test samples, consisting of another batch of 30 finished potted flowers, were the same as the previous batch in terms of variety, source and cultivation time.

The process of collecting images of the samples from three viewpoints can be found in Section 2.1.2. The same counting strategy as in Section 2.3.1 was adopted, but one sample only corresponded to a group of three viewpoints in one counting test. The method was evaluated by sample counting accuracy (SCA). The counting accuracy is expressed as:

S C A = \frac{N_{C S}}{N_{A S}} \times 100 %

(11)

where

N_{C S}

is the number of correct samples and

N_{A S}

is the number of all samples counted at the optimal number of viewpoints.

3. Results

3.1. Detection Results of PA-YOLO

3.1.1. Step-by-Step Results

As shown in Figure 8 and Table 2, several successive modifications were made to YOLOv5s. According to Figure 8a, by designing a two-scale detection branch, we increased the model accuracy by 0.34% F1 and 0.6% mAP50, while significantly reducing the number of parameters. Optimizing the number of bottlenecks improved the model accuracy by 0.2% F1 and 0.5% mAP50, at a small cost to model parameters. Integrating DyHead boosted the performance by 0.25% F1 and 0.5% mAP50, at a small complexity cost. The combined effect of these modifications resulted in PA-YOLO’s performance being 0.79% F1 and 1.6% mAP50 higher than the baseline. The model F1 and mAP50 finally reached 92.74% and 95.4%, respectively, demonstrating exceptional detection performance for potted Phalaenopsis flowers. More importantly, as shown in Figure 8b, these modifications resulted in a cumulative increase of 4.1% in the AP of occluded blooms, with individual increases of 1.4%, 1.8% and 0.9%, respectively. This increase in AP accounted for most of the rise in mAP50, indicating that PA-YOLO’s new architecture significantly improved its ability to detect occluded blooms.

3.1.2. Ablation Study of the Three Improvements

To thoroughly examine the impact of combining various improvements on model performance and to validate the effectiveness of each individual improvement, an ablation study was conducted by removing each of the three improvements. The outcomes of the ablation study for the different improvements are presented in Table 3.

As shown in Table 3, when the improvements were sequentially removed, the F1 values decreased by 0.34%, 0.56% and 0.25%, respectively, while the mAP50 values decreased by 0.8%, 0.9% and 0.5%, respectively. These results indicate that all three components contributed to the enhancement of model performance.

Upon further observation of the Params and FLOPs metrics in Table 2 and Table 3, it is evident that both the two-scale detection branch and the optimization of the number of bottlenecks (ONoB) are straightforward and effective improvements. These two improvements significantly boosted model detection performance with lower hardware cost overheads. In contrast, while the integration of DyHead did increase the model’s parameter size somewhat compared with the first two components, its reduction relative to the baseline remained significant at 28.61%. Furthermore, according to Table 2, DyHead visibly improved the overall detection performance of the model, particularly in terms of changes in mAP50:95, where the increase upon integrating DyHead compared to its absence was 2.2%.

It should also be emphasized that DyHead targeted improvements for occluded blooms, achieving a 1.19% increase in F1 for occluded blooms, as shown in Figure 8b. According to Table 2, compared to 83.9% without DyHead, its integration raised the recall for occluded blooms by 3.2% to 87.1%. This accounted for the fact that the increase in F1 for occluded blooms stemmed from a substantial increase in recall without sacrificing precision when integrating DyHead. The enhanced recall ensured the accuracy of the flower counting process.

To better understand each improvement, Grad Class Activation Mapping (Grad-CAM) [40] was employed to visualize the differences in network feature extraction of each improvement. The Grad-CAM technique utilizes the gradient information from the feature map before the prediction layer in the network to determine the significance of each feature point for the target detection. As can be observed in Figure 9a,b, the two-scale detection branch eliminated the excess parts activated in the original background, indicating that this improvement enabled the network to focus more on targets. The difference between Figure 9b,c is that the features of another occluded bloom next to the predicted occluded bloom were deeply activated, demonstrating the enhancement of feature representation capability brought about by optimizing the number of bottlenecks. The change between Figure 9c,d is that the two originally adhered occluded blooms were separated and the features of the bloom at the edge were more activated, albeit to a lesser degree; this sufficiently demonstrates that the integration of DyHead directed more attention toward features conducive to recognizing occluded blooms.

3.1.3. Comparison with Other Representative Deep Learning Algorithms

To thoroughly validate the performance of the proposed detection algorithm, PA-YOLO was benchmarked against representative target detection algorithms, including one-stage networks such as SSD [41], RetinaNet [42], the YOLO series and RTMDet [43]; two-stage networks such as Faster R-CNN [10] and Cascade R-CNN [44]; and popular transformer-based models such as Swin Transformer [45] and DETR [46]. Experiments were conducted using a consistent training set on identical hardware, with the models evaluated on the same test set.

The comparison results are presented in Table 4. The models in the table were separated into two classes based on their input resolution. The first category comprised models with high input resolutions, such as (800, 1333) and (896, 896), while the second category consisted of models with an input resolution of (640, 640). All models employed proportional scaling to resize the original images.

The top-performing models at each resolution were identified and are presented in Figure 10. Compared to the models with high input resolutions, PA-YOLO exhibited superior performance on the mAP50 metric, even with a lower input resolution of (640, 640), achieving a value of 95.4%. This value was 1.5% higher than the best Deformbale DETR [47]. In terms of detection performance for occluded blooms, PA-YOLO surpassed the best-performing EfficientNet [48] by 2.2%, achieving a score of 91.9%. Furthermore, given that the models adapted to high input resolutions typically have larger model sizes and computational demands, the advantages of PA-YOLO were particularly noteworthy. It is worth mentioning that, despite being limited to a maximum resolution of (512, 512), lower than (640, 640), SSD achieved an accuracy of only 89.3% while having a model complexity more than four times larger than that of PA-YOLO.

Among the YOLO series models with the same input resolution, including RTMDet with YOLOX [49] as the baseline, PA-YOLO achieved the highest average precision in both overall detection and the detection of occluded blooms. Compared to YOLOv7-tiny, which had a comparable model complexity, PA-YOLO led by 0.9% on mAP50 and by 5.9% on mAP50:95. Compared to YOLOv7, which had the closest detection performance, PA-YOLO maintained an advantage of more than six times in model complexity while also demonstrating better detection performance for occluded blooms.

Figure 11 shows the detection instances of the four best models. For the three detected samples, as shown in Figure 11a, YOLOv5s exhibited cases of false detection, duplication detection and missed detection, respectively, as indicated by the red arrows and red dashed circles. Deformable DETR, which had a detection performance comparable to that of YOLOv5s, did not correct these incorrect detections. In contrast, YOLOv7, which had the second-best detection performance, effectively resolved the incorrect detections observed with YOLOv5s but exhibited additional cases of duplication detection and missed detection, as illustrated in Figure 11c. Similarly, PA-YOLO effectively corrected all cases of incorrect detections observed with YOLOv5s. However, in comparison to YOLOv7, PA-YOLO exhibited only one case of incorrect duplicate detection of an occluded bloom, as shown in Figure 11d; this is because the annotated occluded blooms in the training data included some similar learning samples and the powerful feature learning ability of PA-YOLO made it overly sensitive to such samples. This issue can be addressed by increasing the number of learning samples for that viewpoint and adjusting the boundaries of the annotation boxes more strictly.

In conclusion, compared to 27 representative target detection models, PA-YOLO is a high-precision, low-complexity target detection model, demonstrating best detection performance for Phalaenopsis flowers and especially for occluded blooms. Therefore, PA-YOLO is the optimal choice for accurate flower counting in the grading of potted Phalaenopsis plants.

3.2. Results of the Optimal Number of Viewpoints

By partitioning the 360-viewpoint images from each potted Phalaenopsis sample into groups corresponding to the number of viewpoints, we calculated the viewpoint counting accuracy of the buds and blooms for 23 plant samples under different numbers of viewpoints, as illustrated in Figure 12. Both the buds and blooms exhibited two trends in the counting results. One trend, depicted as blue points in Figure 12a,b, showed that counting accuracy first increased sharply and then decreased slightly, with an inflection point occurring at two viewpoints for buds and three viewpoints for blooms. The other trend, depicted as red points in Figure 12a,b, was that the counting accuracy continued to rise with increasing numbers of viewpoints until reaching 100% at six viewpoints. The observed trends distinctly delineate the implications of duplication detections and missed detections on the process of counting. The detrimental effects of duplication detection on the counting accuracy are highlighted in the first trend. Conversely, the second trend underscores the efficacy of employing multiple viewpoints in mitigating issues related to missed detection. For a comprehensive measure, an average was computed for the viewpoint counting accuracy across all samples, as shown in Figure 13. The highest average counting accuracy occurred at three viewpoints, reaching 96.25% and 93.33% for the buds and blooms, respectively; this indicates that setting the number of viewpoints to three was optimal for counting. Furthermore, even when the accuracy of other numbers of viewpoints was not maximal, it remained significantly higher than that of a single viewpoint. This finding is unsurprising, as increasing the number of viewpoints generally results in improved accuracy. However, the results from this study indicate that a higher number of viewpoints was not always better, as accuracy might decrease slightly as the number of viewpoints increased.

3.3. Flower Counting Test Results

By using PA-YOLO to detect flowers and setting the number of viewpoints to three, we counted the flowers of 30 samples and compared the quantities with the actual numbers of flowers to determine the counting accuracy. Due to the limited number of samples, the test was repeated three times. The final results, presented in Table 5, indicate that the average accuracy of counting using PA-YOLO based on three viewpoints could reach 95.56%, demonstrating the high feasibility of the proposed method in this study.

Figure 14 illustrates two successful counting cases and three of four unsuccessful counting cases. In viewpoint 1 and viewpoint 3 of Figure 14a, two heavily occluded blooms were not detected, as highlighted by red dashed circles. However, in viewpoint 2, these blooms were successfully detected and the number of flowers was correctly counted. Viewpoint 1 of Figure 14b displayed an overlapped bloom that could not be detected, as depicted by the red dashed circle, while in viewpoint 2 and viewpoint 3, the target was successfully identified, as indicated by the white arrows. These observations demonstrated that the method proposed in this paper could effectively address scenarios with missed detections and achieve accurate flower counting. In contrast, Figure 14c–e show three key viewpoints resulting in unsuccessful counting cases under three viewpoints. All unsuccessful cases were attributed to the presence of duplication detections. Cases c and d in Figure 14 resulted from similarities in color and shape, leading to the misidentification of a small part of the exposed root and a back-facing bloom as buds. Case e resembled the scenario depicted in sample 1 of Figure 11d, where the model’s robust feature learning capability rendered it overly sensitive to occluded blooms. Furthermore, this bloom differed in shape from the others, with its lower petal curling backward and its overall form being less compact, causing the model to perceive it as two separate blooms. The remaining unsuccessful counting case not shown in Figure 14 is almost the same as case Figure 14e.

In summary, the counting method proposed by this study achieved high accuracy and effectively addressed issues of occlusion and overlap.

4. Discussion

This study provides important insights into automatic counting operations for flowers in potted Phalaenopsis plants in grading scenarios, especially for the optimizations targeted on flower detection and the counting of flowers based on multiple viewpoints.

To improve detection precision, PA-YOLO was proposed for flower detection by designing a two-scale detection branch, optimizing the number of bottlenecks and integrating DyHead. Compared with YOLOV5s, PA-YOLO increased by 1.6% on mAP and 4.1% on AP of occluded blooms. Finally, PA-YOLO achieved a mAP of 95.4% and an AP of occluded blooms of 91.9% on our Phalaenopsis flower dataset, both of which are highest in comparisons with 27 representative target detection models. In comparison to the research conducted by Chang [28], which also centered on the detection of Phalaenopsis flowers, our proposed model demonstrated significant improvements in accuracy and model complexity, despite both studies utilizing the YOLO series as a baseline. This enhancement in performance can be attributed not only to the advanced training strategies and design concepts inherent to YOLOv5, but also to our specific optimizations. Secondly, Chang’s study acquired potted Phalaenopsis images from side views and top views, while our model was trained to detect flower targets of all viewpoints from a fixed field of view, thereby enhancing its robustness to viewpoint variations. However, a top view is preferable for potted flowers with less dense flowers or less complex hierarchical structures. Furthermore, our study categorized occluded blooms separately for optimization, thereby enhancing our model’s applicability to complex scenarios with densely populated blooms. It is noteworthy that Chang et al. have conducted preliminary research on grading in the study, which warrants further exploration in future studies. In summary, our model effectively fulfills the requirements of actual grading applications, excelling in both detection speed and precision.

Our model demonstrated commendable detection results, primarily due to our targeted optimization strategies. Firstly, a two-scale detection branch guided the network to focus on medium-scale and small-scale flower targets. Essentially, this operation enables the network to more effectively capture the features with targets of corresponding sizes. While the common practice in model improvement is to add more detection branches, at times, efficiency can be enhanced by reducing redundant scale detection branches. This not only simplifies the network structure but also has the potential to enhance network performance by eliminating redundant gradients and facilitating the learning of targeted features. This concept holds promise for extension to other detection tasks involving medium and small targets. Secondly, the DyHead component of our model directed greater attention towards edge features, which are crucial for recognizing occluded blooms, thereby improving detection performance in challenging scenarios. Furthermore, the optimization of occluded flowers as a separate category is a concept based on task specialization.

In our study, PA-YOLO effectively mitigated incorrect detections of occluded blooms. Compared with YOLOV5s, PA-YOLO’s F1 values increased by 2.34% for occluded blooms. Furthermore, Figure 11 illustrates PA-YOLO’s capability to rectify false detections, duplication detections and missed detections of occluded blooms. In particular, Figure 11d shows that PA-YOLO accurately identified all occluded blooms of sample 3 even under high occlusion, while other well-performing models failed to do so. Thus, PA-YOLO could effectively detect lightly occluded and some heavily occluded blooms. However, similar to other advanced target detection models, PA-YOLO has limitations in detecting other heavily occluded or overlapped blooms. For instance, as shown in Figure 11b, there are still some undetected occluded blooms that only reveal a small area. Detecting these targets in two-dimensional images poses a challenge and it is common practice to ignore these targets during the labeling stage to minimize the risk of duplication detection. To solve this problem, multiple viewpoints were introduced in this study.

Three viewpoints proved to be the optimal number of viewpoints. Our results showed that, as expected, increased viewpoints resulted in better counting accuracy compared to a single viewpoint. However, it was also found that more viewpoints did not necessarily yield better counting results. On the one hand, the overhead of hardware costs must be considered. On the other hand, the presence of duplicate detection instances played a decisive role. Duplicate detection caused changes in the maximum values of blooms or buds, resulting in counting errors. Thus, duplicate detection is intolerable for our counting method. Exploring more reasonable counting strategies is an effective solution direction.

Expanding on the results discussed earlier, this study conducted three counting tests from three viewpoints on an additional batch of 30 samples. The findings revealed a counting accuracy of 95.56% for the 90 samples. By comparison, Houtman et al. [30] reported a maximum counting accuracy of 92% for 71 Phalaenopsis plants, allowing for a margin of one flower. To be more precise, the actual accuracy was 61%. Their study did not discuss the settings of the number of viewpoints. They recorded a set of 20 images within 210 degrees. In determining the optimal number of viewpoints, our study captured three evenly divided viewpoints within 360 degrees. More viewpoints mean more computational resources are invested, especially considering that MHT requires additional online computational resources. At the same time, MHT-based counting approaches necessitate the artificial design of specialized models to reduce false positive detections. This requirement significantly raises the threshold for its application. Our counting method effectively solves the above challenges while fully considering occlusions and overlaps.

While the effectiveness of our counting method has been validated, there is still room for improvement. Firstly, due to limitations in the supply of experimental materials, PA-YOLO has only been tested on potted Phalaenopsis plants of the “Big Chili” variety, and its detection performance on other Phalaenopsis varieties and other ornamental flowers needs further investigation. Generative adversarial networks could be used to expand the dataset for training better-performing detection models for counting [50]. Secondly, our counting method relies on multiple viewpoints and in this study, a fixed camera was employed to capture multiple viewpoints from rotary potted plants. In an actual grading scenario, this was implemented by installing a turntable on a circular conveyor, however the method’s efficiency can be further enhanced. To achieve this, a method of installing multiple cameras on a conveyor belt could be adopted; this requires further research in combination with a specific production scene. In addition, the current counting method utilized only one of the three viewpoint images with the maximum count to simplify the counting task. This method assumes that all buds and blooms are visible from at least one of these three viewpoints. However, this assumption may not be as effective when facing potted flowers of other varieties with complex structures and high flowering density, limiting the expansion of this method. One potential solution involves integrating three-dimensional (3D) imaging, allowing 2D detection to be projected into a global 3D space for counting purposes. As a result, future research should concentrate on developing interdisciplinary solutions that can further solve the flower counting problem in an environment closer to reality.

5. Conclusions

Accurate flower counting is a crucial step in flower grading. To address the issues of low detection precision, occlusion and overlap for flower counting of potted Phalaenopsis plants, in this study we proposed a counting method based on the PA-YOLO algorithm and multiviewpoint imaging. This method achieved precise bloom and bud counting, making it an efficient and effective tool for recognizing flowering conditions on potted plants. The main conclusions are as follows:

(1) To improve detection precision, a new architecture, PA-YOLO, was proposed for flower detection. Improved detection precision was achieved by designing a two-scale detection branch, optimizing the number of bottlenecks and integrating a dynamic head. A total of 1200 original images (1944 × 2592) were collected as the potted Phalaenopsis dataset. Under an NVIDIA RTX 3090TI GPU, the mAP50 of PA-YOLO on the test set reached 95.4%, with an AP of 91.9% for occluded blooms, both of which are the highest among the 27 models.

(2) By detecting grouped 360-viewpoint images according to the number of viewpoints and counting flowers based on the “maximum values within a group” strategy, it was found that at a camera height of 0.5 m and a working distance of 1.3 m, the highest average counting accuracy for buds and blooms was three viewpoints, with scores of 96.25% and 93.33%, respectively. By setting the number of viewpoints to three, this study finally achieved an average counting accuracy of 95.56% in flower counting tests, effectively solving the counting problems under occlusions and overlaps.

Moving forward, this method will be integrated into the vision component of a sophisticated potted flower grading system for further verification in practical production.

Author Contributions

Conceptualization, methodology, formal analysis, investigation, writing—review and editing, Y.Y. and G.Z.; software, validation and data curation, G.Z.; validation and resources, S.M.; methodology and investigation, Z.W. and H.L.; investigation, supervision, project administration and funding acquisition, S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Guangdong Provincial Agricultural Science and Technology Innovation and Extension Project (Grant No. 2023KJ131), Key-Area Research and Development Program of Guangdong Province (Grant No. 2019B020214005) and China Scholarship Council (Grant No. 202107630007).

Data Availability Statement

The datasets in this study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, S.-C.; Lekawatana, S.; Amore, T.D.; Chen, F.-C.; Chin, S.-W.; Vega, D.M.; Wang, Y.-T. The Global Orchid Market. In The Orchid Genome; Chen, F.-C., Chin, S.-W., Eds.; Compendium of Plant Genomes; Springer International Publishing: Cham, Switzerland, 2021; pp. 1–28. ISBN 978-3-030-66826-6. [Google Scholar]
Hsu, C.-C.; Chen, H.-H.; Chen, W.-H. Phalaenopsis. In Ornamental Crops; Van Huylenbroeck, J., Ed.; Handbook of Plant Breeding; Springer International Publishing: Cham, Switzerland, 2018; pp. 567–625. ISBN 978-3-319-90698-0. [Google Scholar]
Cardim Ferreira Lima, M.; Damascena de Almeida Leandro, M.E.; Valero, C.; Pereira Coronel, L.C.; Gonçalves Bazzo, C.O. Automatic Detection and Monitoring of Insect Pests—A Review. Agriculture 2020, 10, 161. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep Learning—Method Overview and Review of Use for Fruit Detection and Yield Estimation. Comput. Electron. Agric. 2019, 162, 219–234. [Google Scholar] [CrossRef]
Aleya, K.F.; Samanta, D. Automated damaged flower detection using image processing. J. Glob. Res. Comput. Sci. 2010, 4, 21–24. [Google Scholar]
Aggelopoulou, A.D.; Bochtis, D.; Fountas, S.; Swain, K.C.; Gemtos, T.A.; Nanos, G.D. Yield Prediction in Apple Orchards Based on Image Processing. Precis. Agric. 2011, 12, 448–456. [Google Scholar] [CrossRef]
Horton, R.; Cano, E.; Bulanon, D.; Fallahi, E. Peach Flower Monitoring Using Aerial Multispectral Imaging. J. Imaging 2017, 3, 2. [Google Scholar] [CrossRef]
Wang, Z.; Verma, B.; Walsh, K.B.; Subedi, P.; Koirala, A. Automated Mango Flowering Assessment via Refinement Segmentation. In Proceedings of the 2016 International Conference on Image and Vision Computing New Zealand (IVCNZ), IEEE, Palmerston North, New Zealand, 21–22 November 2016; pp. 1–6. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Dhaka, V.S.; Meena, S.V.; Rani, G.; Sinwar, D.; Kavita, K.; Ijaz, M.F.; Woźniak, M. A Survey of Deep Convolutional Neural Networks Applied for Prediction of Plant Leaf Diseases. Sensors 2021, 21, 4749. [Google Scholar] [CrossRef]
Mohimont, L.; Alin, F.; Rondeau, M.; Gaveau, N.; Steffenel, L.A. Computer Vision and Deep Learning for Precision Viticulture. Agronomy 2022, 12, 2463. [Google Scholar] [CrossRef]
Jiang, Y.; Li, C.; Xu, R.; Sun, S.; Robertson, J.S.; Paterson, A.H. DeepFlower: A Deep Learning-Based Approach to Characterize Flowering Patterns of Cotton Plants in the Field. Plant Methods 2020, 16, 156. [Google Scholar] [CrossRef] [PubMed]
Wu, D.; Lv, S.; Jiang, M.; Song, H. Using Channel Pruning-Based YOLO v4 Deep Learning Algorithm for the Real-Time and Accurate Detection of Apple Flowers in Natural Environments. Comput. Electron. Agric. 2020, 178, 105742. [Google Scholar] [CrossRef]
Tian, Y.; Yang, G.; Wang, Z.; Li, E.; Liang, Z. Instance Segmentation of Apple Flowers Using the Improved Mask R–CNN Model. Biosyst. Eng. 2020, 193, 264–278. [Google Scholar] [CrossRef]
Shang, Y.; Xu, X.; Jiao, Y.; Wang, Z.; Hua, Z.; Song, H. Using Lightweight Deep Learning Algorithm for Real-Time Detection of Apple Flowers in Natural Environments. Comput. Electron. Agric. 2023, 207, 107765. [Google Scholar] [CrossRef]
Qi, C.; Gao, J.; Pearson, S.; Harman, H.; Chen, K.; Shu, L. Tea Chrysanthemum Detection under Unstructured Environments Using the TC-YOLO Model. Expert Syst. Appl. 2022, 193, 116473. [Google Scholar] [CrossRef]
Palacios, F.; Bueno, G.; Salido, J.; Diago, M.P.; Hernández, I.; Tardaguila, J. Automated Grapevine Flower Detection and Quantification Method Based on Computer Vision and Deep Learning from On-the-Go Imaging Using a Mobile Sensing Platform under Field Conditions. Comput. Electron. Agric. 2020, 178, 105796. [Google Scholar] [CrossRef]
Mu, Y.; Chen, T.-S.; Ninomiya, S.; Guo, W. Intact Detection of Highly Occluded Immature Tomatoes on Plants Using Deep Learning Techniques. Sensors 2020, 20, 2984. [Google Scholar] [CrossRef]
Lin, J.; Li, J.; Yang, Z.; Lu, H.; Ding, Y.; Cui, H. Estimating Litchi Flower Number Using a Multicolumn Convolutional Neural Network Based on a Density Map. Precis. Agric. 2022, 23, 1226–1247. [Google Scholar] [CrossRef]
Fu, L.; Feng, Y.; Wu, J.; Liu, Z.; Gao, F.; Majeed, Y.; Al-Mallahi, A.; Zhang, Q.; Li, R.; Cui, Y. Fast and Accurate Detection of Kiwifruit in Orchard Using Improved YOLOv3-Tiny Model. Precis. Agric. 2021, 22, 754–776. [Google Scholar] [CrossRef]
Sun, K.; Wang, X.; Liu, S.; Liu, C. Apple, Peach, and Pear Flower Detection Using Semantic Segmentation Network and Shape Constraint Level Set. Comput. Electron. Agric. 2021, 185, 106150. [Google Scholar] [CrossRef]
Farjon, G.; Krikeb, O.; Hillel, A.B.; Alchanatis, V. Detection and Counting of Flowers on Apple Trees for Better Chemical Thinning Decisions. Precis. Agric. 2020, 21, 503–521. [Google Scholar] [CrossRef]
Wang, D.; He, D. Channel Pruned YOLO V5s-Based Deep Learning Approach for Rapid and Accurate Apple Fruitlet Detection before Fruit Thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Chang, Y.-W.; Hsiao, Y.-K.; Ko, C.-C.; Shen, R.-S.; Lin, W.-Y.; Lin, K.-P. A Grading System of Pot-Phalaenopsis Orchid Using YOLO-V3 Deep Learning Model. In Advances in Networked-Based Information Systems; Barolli, L., Li, K.F., Enokido, T., Takizawa, M., Eds.; Advances in Intelligent Systems and Computing; Springer International Publishing: Cham, Switzerland, 2021; Volume 1264, pp. 498–507. ISBN 978-3-030-57810-7. [Google Scholar]
Wang, J.; Gao, Z.; Zhang, Y.; Zhou, J.; Wu, J.; Li, P. Real-Time Detection and Location of Potted Flowers Based on a ZED Camera and a YOLO V4-Tiny Deep Learning Algorithm. Horticulturae 2021, 8, 21. [Google Scholar] [CrossRef]
Houtman, W.; Siagkris-Lekkos, A.; Bos, D.J.M.; van den Heuvel, B.J.P.; den Boer, M.; Elfring, J.; van de Molengraft, M.J.G. Automated Flower Counting from Partial Detections: Multiple Hypothesis Tracking with a Connected-Flower Plant Model. Comput. Electron. Agric. 2021, 188, 106346. [Google Scholar] [CrossRef]
Ma, J.; Lu, A.; Chen, C.; Ma, X.; Ma, Q. YOLOv5-Lotus an Efficient Object Detection Method for Lotus Seedpod in a Natural Environment. Comput. Electron. Agric. 2023, 206, 107635. [Google Scholar] [CrossRef]
Rong, J.; Zhou, H.; Zhang, F.; Yuan, T.; Wang, P. Tomato Cluster Detection and Counting Using Improved YOLOv5 Based on RGB-D Fusion. Comput. Electron. Agric. 2023, 207, 107741. [Google Scholar] [CrossRef]
Zhang, D.-Y.; Luo, H.-S.; Wang, D.-Y.; Zhou, X.-G.; Li, W.-F.; Gu, C.-Y.; Zhang, G.; He, F.-M. Assessment of the Levels of Damage Caused by Fusarium Head Blight in Wheat Using an Improved YoloV5 Method. Comput. Electron. Agric. 2022, 198, 107086. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7369–7378. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 24 May 2019; pp. 6105–6114. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Lu, Y.; Chen, D.; Olaniyi, E.; Huang, Y. Generative Adversarial Networks (GANs) for Image Augmentation in Agriculture: A Systematic Review. Comput. Electron. Agric. 2022, 200, 107208. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the related work of our study: (a) preparation of image data; (b) flower detection; and (c) flower counting.

Figure 2. Multiviewpoint imaging system: (a) sketch of the system; and (b) actual working scene of the system.

Figure 3. Example images of a potted Phalaenopsis sample: (a) a collection of images from six viewpoints; (b) an example of an annotation, where the yellow, red and green rectangular boxes correspond to the classes of normal blooms, occluded blooms and buds, respectively; (c) an example of occlusion, where the red rectangular box with an arrow indicates a slightly occluded bloom and the red boxes without an arrow denote severely occluded blooms; and (d) an example of overlap, where the red circle highlights the only visible edge of a bloom that overlaps with a normal bloom marked by a yellow box.

Figure 4. Model structure of (a) YOLOv5s and (b) PA-YOLO.

Figure 5. Optimization of the number of bottlenecks at different C3 stages: (a) optimization in the backbone; (b) optimization in the neck. The number of adjusted bottlenecks is marked in red.

Figure 6. Detailed implementation of DyHead.

Figure 7. Conveyor system with a detection chamber to count Phalaenopsis plants: (a) the overall structure of the circular conveyer; and (b) the internal configuration of the detection chamber.

Figure 8. Step chart of detection results of the three improvements: (a) detection results of all classes; and (b) detection results of occluded blooms. In all figures and tables in this study, OB, 2SDB, OnoB and DyHead refer to occluded blooms, two-scale detection branch, optimization of the number of bottlenecks and dynamic head, respectively.

Figure 9. Grad-CAM of the same detected target of four models: (a) YOLOv5s; (b) YOLOv5s+2SDB; (c) YOLOv5s+2SDB+ONoB; and (d) YOLOv5s+2SDB+ONoB+DyHead. The thermodynamic features of different colors indicate the regional attractiveness to the network, with the red areas indicating the greatest influence on the network. The influences gradually lessen as the colors transition from red to yellow and lastly to blue.

Figure 10. Performance comparison of the best models at different input shapes.

Figure 11. Detection instances of four best models: (a) YOLOv5s; (b) Deformable DETR; (c) YOLOv7; and (d) PA-YOLO. The blue, green and orange rectangular boxes represent normal blooms, occluded blooms and buds, respectively. The red arrows point to targets of incorrect detection. The red dotted circles highlight the undetected targets.

Figure 12. Viewpoint counting accuracy of 23 samples under different numbers of viewpoints: (a) average accuracy for buds, where red and blue represent trends including 18 and 5 samples, respectively; and (b) average accuracy for blooms, where red and blue include 9 and 14 samples, respectively.

Figure 13. Average viewpoint counting accuracy of the buds and blooms of all samples.

Figure 14. Counting cases: (a,b) show two successful counting cases under 3 viewpoints; (c,d) and (e) show three key viewpoints resulting in unsuccessful counting cases under 3 viewpoints. The blue, green and orange rectangular boxes represent normal blooms, occluded blooms and buds, respectively. The red arrows point to targets of incorrect detection. The red circles mark undetected instances of occlusion or overlap. The white arrows point to corrections of undetected targets. A white dotted box is an enlargement of a local area.

Table 1. The statistical results of the potted Phalaenopsis dataset.

Item	Ration	Amount
Training set	6	720
Validation set	2	240
Test set	2	240

Table 2. Step-by-step results on other indicators.

Model	P_OB(%)	R_OB(%)	mAP50:95(%)	Params(M)	FLOPs(G)
YOLOv5s	85.6	85.7	66.9	7.02	15.8
+2SDB	88.8	83.7	67.6	4.05	13.4
+ONoB	89.9	83.9	66.6	4.46	15.5
+DyHead	88.9	87.1	68.8	5.01	16.7

Table 3. Ablation study of the three improvements.

2SDB	ONoB	DyHead	F1_OB(%)	AP_OB(%)	F1(%)	mAP50(%)	Params(M)	FLOPs(G)
―	―	―	85.65	87.8	91.95	93.8	7.02	15.8
―	√	√	86.25	89.1	92.40	94.6	8.00	19.1
√	―	√	86.69	90.6	92.18	94.5	4.60	14.6
√	√	―	86.80	91.0	92.49	94.9	4.46	15.5
√	√	√	87.99	91.9	92.74	95.4	5.01	16.7

Table 4. Comparison with the representative algorithms at different input shapes.

Model	Input Shape	AP_OB(%)	mAP50(%)	MAP50:95(%)	Params(M)	FLOPs(G)
Faster Rcnn	(800, 1333)	0.870	0.926	0.640	41.36	178.0
Cascade Rcnn	(800, 1333)	0.893	0.928	0.669	69.16	205.0
Dynamic Rcnn	(800, 1333)	0.850	0.912	0.663	41.36	178.0
Retinanet	(800, 1333)	0.865	0.914	0.636	36.37	174.0
Centernet	(800, 1333)	0.847	0.917	0.561	32.12	167.0
FCOS	(800, 1333)	0.847	0.916	0.661	33.12	125.0
Swin Transformer	(800, 1333)	0.887	0.931	0.645	44.76	183.0
DETR	(800, 1333)	0.861	0.910	0.562	41.56	80.4
Deformable DETR	(800, 1333)	0.892	0.939	0.676	40.10	165.0
YOLOF	(800, 1333)	0.847	0.872	0.556	42.39	83.3
EfficientNet	(896, 896)	0.897	0.923	0.657	18.38	80.3
SSD	(512, 512)	0.838	0.893	0.562	24.68	87.8
YOLOv3	(640, 640)	0.868	0.917	0.573	61.54	58.1
YOLOX	(640, 640)	0.870	0.924	0.634	8.94	13.3
RTMDet-s	(640, 640)	0.892	0.926	0.658	8.86	14.8
RTMDet-tiny	(640, 640)	0.890	0.927	0.666	4.87	8.0
YOLOv4	(640, 640)	0.875	0.938	0.631	5.25	118.9
YOLOv5n	(640, 640)	0.858	0.931	0.645	1.76	4.1
YOLOv5s	(640, 640)	0.878	0.938	0.669	7.02	15.8
YOLOv5m	(640, 640)	0.865	0.931	0.660	2.09	47.9
YOLOv5l	(640, 640)	0.877	0.937	0.659	4.61	107.7
YOLOv6n	(640, 640)	0.875	0.934	0.627	4.63	11.3
YOLOv6s	(640, 640)	0.877	0.932	0.626	18.50	45.2
YOLOv7-tiny	(640, 640)	0.878	0.939	0.633	6.01	13.0
YOLOv7	(640, 640)	0.909	0.947	0.680	36.49	103.2
YOLOv8n	(640, 640)	0.863	0.935	0.686	3.01	8.1
YOLOv8s	(640, 640)	0.864	0.929	0.692	11.1	28.4
PA-YOLO	(640, 640)	0.919	0.954	0.690	5.01	16.7

Table 5. Flower counting test results based on three viewpoints.

Test Number	N_CS	N_AS	Sample Counting Accuracy (%)
1	29	30	96.67
2	28	30	93.33
3	29	30	96.67
Average	28.67	30	95.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Zhang, G.; Ma, S.; Wang, Z.; Liu, H.; Gu, S. Potted Phalaenopsis Grading: Precise Bloom and Bud Counting with the PA-YOLO Algorithm and Multiviewpoint Imaging. Agronomy 2024, 14, 115. https://doi.org/10.3390/agronomy14010115

AMA Style

Yang Y, Zhang G, Ma S, Wang Z, Liu H, Gu S. Potted Phalaenopsis Grading: Precise Bloom and Bud Counting with the PA-YOLO Algorithm and Multiviewpoint Imaging. Agronomy. 2024; 14(1):115. https://doi.org/10.3390/agronomy14010115

Chicago/Turabian Style

Yang, Yi, Guankang Zhang, Shutao Ma, Zaihua Wang, Houcheng Liu, and Song Gu. 2024. "Potted Phalaenopsis Grading: Precise Bloom and Bud Counting with the PA-YOLO Algorithm and Multiviewpoint Imaging" Agronomy 14, no. 1: 115. https://doi.org/10.3390/agronomy14010115

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Potted Phalaenopsis Grading: Precise Bloom and Bud Counting with the PA-YOLO Algorithm and Multiviewpoint Imaging

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition

2.1.1. Multiviewpoint Imaging System

2.1.2. Multiviewpoint Image Acquisition

2.2. Flower Detection

2.2.1. Dataset Preparation

2.2.2. Experimental Setup

2.2.3. An Improved Flower Detection Architecture

2.2.4. Design a Two-Scale Detection Branch

2.2.5. Optimize the Number of Bottlenecks at Different C3 Stages

2.2.6. Integrate a Dynamic Head Framework

2.2.7. Evaluation Metrics

2.3. Flower Counting

2.3.1. The Optimal Number of Viewpoints

2.3.2. Flower Counting Tests

3. Results

3.1. Detection Results of PA-YOLO

3.1.1. Step-by-Step Results

3.1.2. Ablation Study of the Three Improvements

3.1.3. Comparison with Other Representative Deep Learning Algorithms

3.2. Results of the Optimal Number of Viewpoints

3.3. Flower Counting Test Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI