Next Article in Journal
Discussing the Negative Pressure Distribution Mode in Vacuum-Preloaded Soft Foundation Drainage Structures: A Numerical Study
Previous Article in Journal
An Ultra-Low-Power-Consumption Urban Sewer Methane Concentration Monitoring System Based on Ultrasound
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Green Sweet Pepper Fruit and Peduncle Detection Using Mask R-CNN in Greenhouses

by
Jesús Dassaef López-Barrios
,
Jesús Arturo Escobedo Cabello
*,
Alfonso Gómez-Espinosa
and
Luis-Enrique Montoya-Cavero
Tecnologico de Monterrey, Escuela de Ingenieria y Ciencias, Av. Epigmenio González 500, Fracc. San Pablo, Queretaro 76130, Mexico
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(10), 6296; https://doi.org/10.3390/app13106296
Submission received: 23 March 2023 / Revised: 12 May 2023 / Accepted: 13 May 2023 / Published: 21 May 2023
(This article belongs to the Section Robotics and Automation)

Abstract

:
In this paper, a mask region-based convolutional neural network (Mask R-CNN) is used to improve the performance of machine vision in the challenging task of detecting peduncles and fruits of green sweet peppers (Capsicum annuum L.) in greenhouses. One of the most complicated stages of the sweet pepper harvesting process is to achieve a precise cut of the peduncle or stem because this type of specialty crop cannot be grabbed and pulled by the fruit since the integrity and value of the product are compromised. Therefore, accurate peduncle detection becomes vital for the autonomous harvesting of sweet peppers. ResNet-101 combined with the feature pyramid network (FPN) architecture (ResNet-101 + FPN) is adopted as the backbone network for feature extraction and object representation enhancement at multiple scales. Mask images of fruits and peduncles are generated, focused on green sweet pepper, which is the most complex color variety due to its resemblance to the background. In addition to bounding boxes, Mask R-CNN provides binary masks as a result of instance segmentation, which would help improve the localization process in 3D space, the next phase of the autonomous harvesting process of sweet peppers, since it isolates the pixels belonging to the object and demarcates its boundaries. The prediction results of 1148 fruits on 100 test images showed a precision rate of 84.53%. The prediction results of 265 peduncles showed a precision rate of 71.78%. The mean average precision rate with an intersection over union at 50 percent (mAP@IoU=50) for model-wide instance segmentation was 72.64%. The average detection time for sweet pepper fruit and peduncle using high-resolution images was 1.18 s. The experimental results show that the proposed implementation manages to segment the peduncle and fruit of the green sweet pepper in real-time in an unmodified production environment under occlusion, overlap, and light variation conditions with effectiveness not previously reported for simultaneous 2D detection models of peduncles and fruits of green sweet pepper.

1. Introduction

Modern agriculture is increasingly moving towards automation, which demands smarter frameworks and technologies [1]. Sweet pepper (Capsicum annuum L.), better known as bell pepper, has acquired economic relevance worldwide [2]. The accelerated growth of sweet pepper cultivation areas in China compared to the few technological frameworks that seek to adapt to its complex needs (complete sweet pepper harvesting framework: detection, location, and collection) is of interest. The current generation of sweet pepper detection, location, and harvesting frameworks have a low rate of accuracy and speed compared to technologies developed for other crops, yielding a promising study area. The case of the automated sweet pepper harvesting framework stands out for its non-competitive research results (compared to the results of other economically relevant crops with a similar level of difficulty: occluded scenarios with hidden and overlapping crops, along with the background of the same color as the fruits and stems). In general, research works such as [3] allow us to observe clear examples of this phenomenon.
During the last decades, researchers have dedicated their efforts to improving the metrics, time cycles, and detection of the technological frameworks for harvesting various relevant crops. The Agrobot strawberry harvesting robot is fully automated [4], with a harvesting efficiency of 3–5 s per fruit. The cucumber harvesting robot [5] in occluded environments has a harvest success rate of 85% and an efficiency of 8.6 s per fruit. The apple harvesting robot from Abundant Robotics [6] has an average efficiency rate of 1 s per fruit on trees with a standardized shape. Robotics Plus [7] kiwi harvesting robot has a 51 percent success rate and a harvesting efficiency of 5.5 s per fruit.
In the same way, in the field of sweet peppers, there is no exception: the SWEEPER sweet pepper harvesting robot [8] is capable of working day and night in greenhouses with an average harvest success rate of 61% and a harvest efficiency of 24 s per fruit. Ref. [9] proposes a method for segmenting green sweet peppers on remarkably similar colored backgrounds, obtaining a mean average precision of 0.55 or 55%. Ref. [10] proposes an adaptive segmentation algorithm under various illuminations for red sweet pepper, reaching a recognition rate of 87%. Ref. [11] establishes a support vector machine (SVM) for the detection of pepper fruit, achieving a recognition rate of 74.2%. Ref. [12] takes color space parameters such as brightness Y, Cb, Cr, hue, and saturation as inputs to the neural network that he uses for the recognition of red sweet peppers, reaching 82.16% accuracy.
Recent works have started using machine learning models combined with classical machine vision algorithms [11,12]. The main reason is that machine vision algorithms are no longer sufficient, as they suffer considerable decreases in accuracy when the ambient lighting changes, generating shadows, and other types of effects [13]. Additionally, in occluded scenarios and where there is overlapping of objects, machine vision algorithms have complications in finding a general solution for the detection of certain crop fruits, and other parts of them [13], for example, the peduncle in sweet peppers in greenhouse cultivation scenarios. The difficulty in the task of detecting sweet pepper fruits and peduncles lies in the fact that it is even difficult for a human to be able to identify targets in variable light conditions, of variable sizes and colors (very small to medium), with the background similar to the targets, and occlusion caused by other targets or leaves as shown in Figure 1.
Thus, in the field of sweet pepper peduncle detection, there is relevant research conducted to solve this difficult task; however, it is still scarcer than for sweet pepper fruit. Ref. [14] proposes a 3D detection system for red and green sweet pepper peduncles, based on color and shape information, and in conjunction with an SVM achieving an area-under-the-curve (AUC) of 0.71 for point level segmentation. Ref. [14] generates a dataset of manually annotated 3D images of the fruits and peduncles of sweet peppers, specifically, a total of 72 sweet peppers (61 red, two mixed (red–green), and only nine green sweet peppers). This implementation uses a 79/21 dataset split methodology; ergo, 79% of the images are for training, and 21% for testing. Based on these conditions, for training, ref. [14] uses 50 red sweet peppers, only five green sweet peppers, and two mixed sweet peppers. For testing, ref. [14] uses 11 red sweet peppers and only four green sweet peppers. In this way, the value of the metric reported (AUC) by [14] is an average of two tests with different amounts of green, red, and mixed sweet peppers. The first test contains in total, from training and testing, only 28 red sweet peppers, without any other sweet pepper color variety. The second test contains 33 red sweet peppers, two mixed sweet peppers, and only nine green sweet peppers. Ref. [14] shows that for the first test, it obtains the best results of the reported metric, in this case, an AUC of 0.74, demonstrating the ease in comparison for the detection of sweet pepper fruits and peduncles in its red variety. Ref. [14] shows that for the second test where objects of mixed and green sweet peppers are added the AUC value is 0.67, thus demonstrating the comparative difficulty in the detection of the peduncle of the mixed and green color varieties of sweet pepper. Furthermore, ref. [14] reports times to process only the input data of around 10 s, which, adding the internal processing time to obtain the total processing time per image, shows us that the real-time performance of the proposed algorithm needs to be improved.
In addition, ref. [15] similarly proposes a 3D detection method for green and red pepper peduncles, based on shape and color information as inputs for a partial least squares discriminant analysis (PLSDA), which achieves an average classification accuracy of 87.88%. Ref. [15] uses the dataset generated by [14] to conduct training and testing, considering 70 samples of red, green, and mixed (red–green) sweet peppers (main dataset of [15]). Ref. [15] uses two classification models and two subsets with different numbers of sweet pepper color varieties to measure the impact on detection metrics (does not split the main dataset, but randomly forms two subsets with samples that can be repeated in one or the other subset). Both groups use 42 samples for training and 12 for testing. Group A uses an unbalanced dataset, where 24 samples are red sweet peppers, 12 mixed, and six green for training. For testing, group A uses eight, four, and two samples of red, mixed, and green sweet peppers, respectively. Group B balances the number of samples per color, 14 for each of the color varieties for training, and for testing, four for red and five for mixed and green each. The first model achieves a value of 89.76% in classification accuracy for group A, where more relevance is given to the red variety of sweet pepper. However, the same model for group B, where the same relevance is given to the red, green, and mixed color varieties, achieves a value of 77.43% in classification accuracy, showing a decrease of 12.33% in comparison. Likewise, for the second model, from which the best value metrics are reported, group A reaches a value of 90.03% and decreases to 87.88% in classification accuracy, thus again demonstrating the comparative difficulty in detecting the peduncle in the green variety of sweet pepper. Ref. [15] does not report processing times or related metrics.
Regarding detection methods for other crops, ref. [16] proposes a multitask method for the detection of the main stem, secondary peduncles, and fruits of the tomato bunch (main key points) based on bounding boxes and key points (multi-task deep cascade learning network). Ref. [17] proposes a method for estimating grape cluster poses based on Mask R-CNN and point cloud segmentation, achieving a mean intersection over union (mIOU) for instance segmentation of 87.9%, and for the detection results a mean precision of 86.0%, a mean recall of 79.9%, and an F1-score of 0.828. Ref. [18] generates a parallel network structure based on a CNN and a transformer for the segmentation of instances of grape clusters and grape peduncles in strongly occluded scenarios, achieving an IoU (intersection over union) value for the grape peduncle segmentation of 72.1%, and a mIoU (mean intersection over union) value of the general model of 83.7%. Ref. [19] presents a computer vision system for the detection of fruits and localization of peduncles in tomato crop that achieves classifications of beef and cluster tomatoes of 80.8% and 87.5%, respectively. Ref. [20] proposes the use of the Mask R-CNN algorithm for the detection and segmentation of mature green tomatoes, achieving F1-scores for bounding boxes and mask regions of 92.0% when the IoU is equal to 0.5 and using a ResNet50+FPN as the backbone network. Ref. [21] proposes a method for peduncle cutting point localization and pose estimation based on YOLOv4-Tiny as a detector and YOLACT++ Network as an instance segmenter, achieving a detection accuracy of 92.7% and a mAP of 73.1% for segmentation.
After extensive research, it was identified that only [14,15] constitute the state-of-the-art models of sweet pepper peduncle detection and simultaneous detection of sweet pepper fruits and peduncles; ergo, more research is needed to solve this issue.
Traditional and isolated machine vision algorithms are no longer enough for the challenging task of detecting the fruit and peduncle of sweet pepper, and it is necessary to direct efforts to the use of deep neural network (DNN) methods [22], a technology that has proven to be more robust, and capable of solving these problems in a better way.
Compared to previous traditional machine vision methods, deep neural network methods have positioned themselves as the most widely used for crop detection due to their strong and extensive ability to learn autonomously and perform feature extraction [23,24]. Ref. [25] manages to optimize the structure of the VGGNet model (Visual Geometry Group, a classical convolutional neural network) and generate an eight-layer network, intended for the extraction of tomato organ features; for example, flower and fruit. Ref. [26] presents a work based on the LeNet convolutional neural network (CNN), which surpasses other traditional methods in accuracy and speed for the identification of multi-cluster kiwifruit. The adoption of the Fast R-CNN model [27] based on multimodal fusion information (near-infrared and RGB) is used for the detection of sweet peppers. Ref. [28], aiming to achieve real-time detection of apple fruit, improves the YOLO-V3 model, and uses the DenseNet method to process the initial layers of low-resolution features. The experimental results showed that the improved model performed better than the Fast R-CNN model and the original YOLO-V3 model. However, the deep learning models mentioned above (R-CNN, Fast R-CNN, Faster R-CNN, and YOLO) can only approximately calculate the position of the object using bounding boxes; ergo, such models are not capable of accurately extracting shape and contour features and information. It is well known that hard nuts such as pears, apples, and citrus, among others, can be harvested by pulling, that is, identifying the crop’s fruit is sufficient. In comparison, the harvesting of sweet peppers is achieved by cutting the stem at the point of collection, in this case, the peduncle, this is to avoid damaging the integrity and value of the product. In this way, a high precision and real-time recognition of the shape and contour of the crop is necessary, with special attention to its peduncle, which indicates that neither the machine vision methods individually nor the previously mentioned machine learning methods can meet the detection and processing time requirements of the fruit and peduncle in the autonomous harvesting of sweet peppers. However, ref. [29] first proposed the Mask Region Convolutional Neural Network (Mask R-CNN) in 2017, which manages to integrate target detection (bounding boxes), and instance segmentation (binary masks, which pixels belong to the object detected within the bounding boxes), demonstrating its ease of generalization to other tasks and potential by outperforming all existing (at its release date), single-model entries on every task (all three tracks of the COCO Common Objects in Context suite of challenges: person keypoint detection, bounding box object detection, and instance segmentation), including the COCO 2016 challenge winners [29].
In this paper, a method for the simultaneous detection of the fruit and peduncle of sweet pepper in its most common color varieties (green, yellow, orange, and red), focusing on green sweet pepper, the most complicated color due to the environment and using around 14,600 objects (85% of the dataset), besides considering examples in all its phases of ripeness based on Mask R-CNN, capable of working in real-time in a real production environment (in this case, the sweet pepper greenhouse at CAETEC Experimental Agricultural Field of the Tec de Monterrey campus Querétaro) was proposed. In an occluded environment, where the targets are so similar to their background and obstacles such as leaves and branches cannot be removed during the growth process, Mask R-CNN not only manages to recognize the categories (fruit and peduncle) in about 1 s with a medium–high precision and mark the regions of the objects with bounding boxes but also extracts the regions of the objects at the pixel level, similar to a binary mask.
We propose a real-time simultaneous 2D detection method for green sweet pepper peduncles and fruits based on a Mask R-CNN implementation (the most complex color variety of sweet pepper), thus aiming to strengthen the state-of-the-art of simultaneous sweet pepper peduncle and fruit detection methods based on the use of a state-of-the-art model in instance segmentation not previously reported for this application (Mask R-CNN), in addition to generating an implementation capable of processing high-resolution two-dimensional images in real-time, and finally achieving instance segmentation results not previously reported in this field of research.
The structure of the remainder of this paper is as follows: Section 2 introduces the data acquisition process, the image annotation method, and dataset construction, the Mask R-CNN instance segmentation algorithm, the evaluation metrics, and the implementation of the Mask R-CNN model developed. Section 3 presents the experimental results, analysis, and corresponding discussion. Finally, Section 4 summarizes the conclusions and future work.

2. Materials and Methods

2.1. Image Acquisition

Experimental images were acquired in the sweet pepper greenhouses of CAETEC (Experimental Agricultural Field of the Tec de Monterrey campus Querétaro), shown in Figure 2, in October 2022. The only sensor used to capture the images of the sweet peppers was the Intel D435i camera. The photographs were taken by placing the camera perpendicular to the crops, with an average height of 1.3 m, and an average horizontal shooting distance of 60 cm (Figure 3). The main reason for these photo-taking conditions was the working materials available inside and outside the greenhouse to take the images as efficiently and quickly as possible. This experiment acquired about 30,000 images at different periods and under varying light intensity. Then, 507 were chosen to create our dataset. The dataset used as a base [30] follows the same conditions and generates 620 images, which, together with those obtained and selected, make up a dataset of 1127 images in total. Both sets of images were stored in PNG format with a resolution of 1280 × 720 pixels. The image acquisition diagram is shown in Figure 3, and a summary of the characteristics of the objects of the datasets generated and used is also shown in Table 1.

2.2. Dataset Construction and Annotation

In this experiment, the images were scaled to 1024 × 1024 pixels. To avoid singularities within the set of images obtained (own and those of the base dataset), it is ensured that the set of images includes images of sweet peppers in various natural conditions. A total of 1127 images were randomly selected and used for parameter optimization and training of the Mask R-CNN model, following the data set methodology of 80% as the training set and 20% as the verification set (80/20). Once the training was completed, 100 random images from the initial own set obtained (from the about 30 k initial images) were used to evaluate the performance of the trained model.
Given the need to obtain reliable and relevant results, the 100 images used for evaluation were acquired from our initial set of images, which considers daylight illumination variations (full harvest period), many varied examples of shadows, overlapping fruits and peduncles, leaves, and stems of the main crop, and some examples with distortion from moving images, variation in size, visibility, brightness, and quantity per image of sweet pepper peduncles and fruits (different ripeness and color). In addition, because our focus is on the most complex color variety of sweet pepper, our training and evaluation sets contain mostly examples of green sweet peppers (due to their resemblance to the environment). Therefore, our implementation aims to provide results in a real production environment that is challenging and does not consider ideal examples where lighting is very controlled and other factors are chosen to obtain images that are easier to analyze, based on a realistic and complex dataset that considers most cases of the real production environment. In addition, ref. [30] is the only one that provides a database with the same characteristics; therefore, it is used as a base as described in the previous subsection.
The image annotation tool VGG Image Annotator (VIA), Visual Geometry Group VGG, Oxford, UK [31], in version 2.0.12, was used to annotate/label the experimental dataset to generate images of sweet pepper masks. The performance of the trained model, for instance segmentation, is evaluated by comparing the annotated mask images against the mask prediction results. The regions of the sweet pepper belonging to its peduncle or stem, and fruit were labeled, and the remaining region passes by default as the background. Annotated sweet pepper images are shown in Figure 4.

2.3. Mask R-CNN Algorithm

The state-of-the-art Mask R-CNN method in instance segmentation is selected for its effectiveness and simplicity [32]. The Mask R-CNN method is developed for object instance segmentation (grouping of pixels that belong to the same object) and extends the Faster R-CNN method [33] by adding a function to predict masks in each region of interest (RoI) of parallel form [29]. In general, the Mask R-CNN is a two-step algorithm: (1) it generates proposals (bounding boxes of candidate objects) after reading the image; and (2) predicts the bounding box, the class, and a binary mask (instance segmentation) for each RoI [29]. Now, in structural terms, the Mask R-CNN framework consists of the following stages (Figure 5):
  • In the first stage, the backbone network extracts the initial feature maps from the input images: a first part of the backbone architecture, a ResNet network (residual learning network) [34] used for feature extraction, and a second part of the backbone architecture, a Feature Pyramid Network (FPN) (for more information, please refer to [35]) to improve the representation of objects at multiple scales [36];
  • In the second stage, the feature maps obtained from the backbone are sent to the Region Proposal Network (RPN) to generate regions of interest (RoIs);
  • In the third stage, the RoIs obtained from the RPN are mapped to obtain the corresponding features in the shared feature maps, and finally sent to a Fully Convolutional Network (FCN) [37] and a Fully Connected layer (FC), respectively, for instance segmentation and target classification: regarding instance segmentation, an FCN with a RoIAlign layer (refinement of the RoI pooling) [29] and bilinear interpolation is used to predict pixel-by-pixel accurate masks;
  • Before entering the FC (for target classification), RoIAlign is applied to adjust the dimensions of the RoIs to meet the input requirements of the FC (a fixed-size feature map) [13].
For a deeper discussion of the Mask R-CNN algorithm please refer to [29].

2.4. Evaluation Metrics

In machine learning, a confusion matrix [38] classifies the object detection output (bounding boxes, masks, among others) as one of four possibilities [39]: true positive (TP), false positive (FP), true negative (TN), or false negative (FN).
Three analytical criteria are derived from these four possibilities, as defined by (1)–(3): accuracy, precision, and recall. A high degree of proportionality between precision and recall usually means a valid prediction algorithm.
Accuracy = T P + T N T P + T N + F P + F N
Precision = T P T P + F P
Recall = T P T P + F N
As defined by Equation (4), the F1-score combines precision and recall in a single metric taking its harmonic mean, which we can use as an indicator of general precision [40]. The F1-score can also be used to show the stability of the model. The higher the value, the greater the stability of the model.
F 1 - score = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
The mean value of the average precision of all classes is the mAP, which is a comprehensive metric based on precision and recall. The mAP is calculated as shown in Equation (5):
mAP = 1 n i = 1 n 0 1 P i ( R i ) d R i
where P and R represent precision and recall, respectively; n represents the number of total classes, and i represents the current class.
The precision rate (2), recall rate (3), F1-score (4), and mean average precision (mAP), as defined by Equation (5), were used to evaluate the model performance of the trained Mask R-CNN model. These metrics are calculated by comparing the ground truth of the annotated segmentation mask against the segmentation mask predicted by the model, as opposed to the more common use of bounding boxes.

2.5. Mask R-CNN Model Implementation

We implemented the Mask R-CNN model using an open-source package built on Keras 2.1.2, and TensorFlow 1.3.0 developed by the Matterport team [41] in an Anaconda environment on Python 3.6. This deep learning framework was running on a computer ASUS ROG Strix G15 G513 with an AMD Ryzen 9 5900HX, an NVIDIA GeForce RTX 3060 Laptop GPU, with 16 GB RAM at 3200 MHz on a Windows 10 64-bit operating system.
In the training process, a graphics processing unit (GPU) was used to train the Mask R-CNN model with a ResNet-101 backbone (with 101 layers) with a batch size of 1 image, 450 steps per epoch, learning rates of 0.001 and 0.0001, learning momentum of 0.9, and weight decay of 0.0001. For more details, we refer readers to the original repository of this implementation of the Mask R-CNN model on GitHub (https://github.com/matterport/Mask_RCNN, accessed on 1 November 2022). Due to the extension of our dataset, we decided to use transfer learning and data augmentation techniques. Regarding transfer learning, we trained our model using the pre-trained weights of Mask R-CNN for COCO (Common Objects in Context) as a base because COCO has a huge amount of training data for the Mask R-CNN model to learn discriminative and common features. Regarding data augmentation, random horizontal flips 50% of the time were used to introduce variability in the training dataset. Within the inference process, we set a minimum detection confidence threshold of 85% (i.e., detections with a confidence level of less than 85% were ignored).

3. Results and Discussion

During the experiment, 1127 images of sweet peppers, in their most common varieties (green, red, yellow, and orange), focusing on green sweet peppers (more than 85% of the objects within the dataset are part of green sweet peppers), were selected for training (80% for the training dataset and 20% for the validation dataset). To verify the performance, stability, and reliability of the trained model, 100 images of sweet peppers (1148 fruits and 265 peduncles) were selected for the model evaluation. All sweet pepper fruit and peduncle targets within the images were intended to be detected, classified, and marked with category scores, bounding boxes, and instance segmentation masks.

3.1. Model Training and Loss Functions

We trained the Mask R-CNN model for 301 epochs only on its heads, to store the complete training and validation loss curves. During each epoch, a file is generated with the values of synaptic weights up to that epoch. This allows us to retrieve the synaptic values of the model at any time. Training and validation loss values are stored each epoch via Tensorboard.
As can be seen in Figure 6, the validation loss value reached its minimum in epoch 30 and then rebounded; however, the training loss value continued to decrease, which means that from epoch 30, the model tended to memorize the training data, rather than learn to generalize the features of the fruits and peduncles of sweet peppers (also see Figure 7, Figure 8 and Figure 9). In the same way, Figure 6 shows strange behavior in the training loss curve, referring to the abrupt change of the slope in epochs 25 and 45. This behavior can be explained by the manual variation of the learning rate we performed as a test during epochs 25 and 45. In this case, we experimented for only time changing the learning rate from 0.001 to 0.0001 in epoch 25, observing a slower decrease in the values of the training loss curve and again returning to the value of the learning rate of 0.001 in epoch 45. In general, all this shows that this implementation of the Mask R-CNN model, with the parameters and dataset provided, tends to overfit after epoch 30. However, the model reached a training loss value of 0.34, and it took about 3 days to complete training for all 301 epochs, approximately 14.4 min per epoch.
In this specific case, we selected the Mask R-CNN model trained by 100 epochs. This decision was made through an evaluation of the metrics in the 100 sample images for testing, since the model in this epoch showed the best results of the different epochs considered, in this case, we evaluated in 30, 49, 100, 150, 200, 250, and 301 epochs. As we expected, the values of the metrics in the highest epoch were the worst, and the best was between 49 and 150 epochs, observing the loss graphs.

3.2. Experiment and Evaluation of Sweet Pepper Fruit and Peduncle Detection

The detection performance evaluation of the Mask R-CNN model for sweet pepper fruits and peduncles is shown in Figure 10, and the confusion matrix of the detection results for the 100 sample images is listed in Table 2. The results of the 100 test images showed that the overall precision, recall, and F1-score rates were 78.16%, 66.86%, and 71.89%, respectively. In addition, it achieved an average inference time of 1.18 s per image. In Table 3, we see the detailed results for each class of the trained Mask R-CNN model.
From Table 2, it is concluded that the detection of the fruit performs better than peduncles. In this case, if we evaluate the classes as individual models, we realize that the sweet pepper fruit detection metrics are good and stable, reporting a precision rate of 84.53%, a recall rate of 79.01%, and an F1-score of 81.67%, which are considered good values, in short, a good detection in quality and quantity of sweet pepper fruit.
Now, regarding the sweet pepper peduncles, if we again consider it an isolated model with a precision rate of 71.78%, a recall rate of 54.72%, and an F1-score of 62.10%, and compare them with the results of previously reported sweet pepper peduncle detection models, ref. [15] is the only one that reports better values, but with a smaller dataset, which leads to less reliable results, and also uses another type of technology. In this case, the models previously reported for the detection of sweet pepper peduncles do not use 2D information but 3D, which categorically separates the results obtained by technologies. Furthermore, even if the results of previously reported 3D detection methods for sweet pepper peduncles were reliable enough, the detection times do not even come close to those of 2D technologies. A clear example is the time only to process the input data of the classification model of [14], which is around 10 s compared to the 1.18 s obtained for the entire process in our implementation. That is why, frequently, we find a greater number of detection implementations in 2D, with the simple objective of bringing the implementation close to one second or less (real-time). Furthermore, previously reported models do not focus on simultaneous fruit and peduncle recognition, while our implementation does. On the other hand, there is the fact that our implementation focuses on the recognition of the peduncle and fruit of the green sweet pepper, the most complicated color variety, as confirmed by the results of [14,15]. Our dataset includes more than 14,600 objects belonging to the fruit and peduncle of the green sweet pepper, which corresponds to about 85% of the objects in our complete dataset.
The general results and main errors can be explained by analyzing the results by class, the dataset, and the complicated environments where the objects of interest are found. The results of the sweet pepper fruit class give us positive, stable metrics, and within a good range. This is due to the considerable number of objects present and the area per object within the dataset (see Table 1). However, the results of the peduncle class are worse in comparison. This is due to the few objects present and the area per object within the dataset (see again Table 1). In this case, objects of the class peduncle represent only about 21% of the object’s class fruit. In addition, an individual object class peduncle does not represent more than 15% of an individual object class fruit in terms of area. The size and quantity play a significant role in the model’s performance. In the same way, although it is well known that a sweet pepper must be harvested by cutting the peduncle, it is very difficult to detect it in such complicated environments, with a background of the same color as the target, occluded and/or variable lighting, where even a human being has trouble detecting the fruit, which is approximately 6.5 times larger than the peduncle.
Overall, the model’s average performance on detection is in the mid-high range. A nearly good quality (detection about 80%), with decent quantity (recall about 70%), and stable (F1-score over 70%).

3.3. Evaluation of Instance Segmentation

Regarding the segmentation results of the 100 test sample images, it is observed that the mean average precision mAP rate with an IoU intersection over the union at 50 percent or mAP@IoU=50 for fruits and peduncles can reach up to 72.64% or 0.726, which surpasses results of previous 2D implementations that simultaneously consider sweet pepper peduncles and fruits focused on the green variety. This metric has to do with the fact that we use Mask R-CNN, which allows us to obtain a more refined mask of the targets, that is, to be able to delineate the outline of the targets in our image. The reason for using Mask R-CNN is that we seek to implement this detection model in the next phase of autonomous sweet pepper harvesting, the localization. The binary mask provided by Mask R-CNN allows us to know more precisely where the targets are. Common detection methods do not integrate instance segmentation, so they can only get a bounding box of the target, a little faster than getting an irregular binary mask, but less accurate, since all they can get is the object enclosed by a rectangle in space; therefore, you are not certain that all the information contained in that bounding box belongs to the target. In a few words, the binary mask generated by Mask R-CNN gives us more accurate information about where the object is since it seeks solely and exclusively to identify the pixels that belong to it, regardless of whether it has a regular or irregular shape. In this way, with better detection of the sweet pepper fruit and peduncle, we will be able to generate better results for the localization in the three-dimensional space of the sweet pepper, which leads to better results in the complete process of autonomous harvesting of green sweet pepper. More example images of sweet pepper fruit and peduncle instance segmentation are shown in Figure 11.

3.4. Limitations

Regarding the limitations of our implementation, we highlight the following:
  • The proposed method is limited to performing simultaneous 2D detection of sweet pepper peduncles and fruits in their most common color varieties (green, yellow, orange, and red), focusing on green sweet pepper, the most complex color variety due to its resemblance to the environment, and in a greenhouse environment;
  • Own dataset complemented with more than 14k objects: even when we consider more than 1k images and more than 14k objects for our implementation, public datasets from other fields of research already have more samples with more objects annotated for training and evaluation;
  • Imbalanced dataset: we considered more than 14k fruit-type objects, but only 3k peduncle-type objects; a balanced dataset would help to further improve the results.

4. Conclusions

In this paper, a method for detection of the fruit and peduncle of sweet pepper, focused on green sweet pepper, this being the most complicated color variety due to its resemblance to the background, based on Mask R-CNN, capable of working in real-time in a real production environment with good metrics for the fruit, average for the peduncle, and a segmentation metric value not previously reported for 2D simultaneous detection models of the fruit and peduncle of the green sweet pepper was obtained. The results of this work are summarized as follows:
(1)
A Mask R-CNN model which can automatically detect sweet pepper peduncles and fruits was trained in this paper, capable of obtaining bounding boxes and instance segmentation masks in an unmodified production environment under occlusion, overlap, and light variation conditions with performance metrics within the medium-high range. The results of the detection of the fruit and peduncle of 100 test images show, in general, that the precision, recall, and F1-score rates were 78.16%, 66.86%, and 71.89%, respectively.
(2)
The proposed implementation manages to obtain a mean average precision (mAP) rate of 0.726, not previously reported for simultaneous 2D detection models of peduncles and fruits of green sweet pepper.
(3)
More specifically, the results for the fruit class were the following: 84.53%, 79.01%, and 81.67% for precision, recall, and F1-score rates. The trained model is particularly effective for detecting green sweet pepper fruits, the most difficult color variety due to background resemblance, in occluded, overlapping scenarios, and a real production environment.
(4)
The results for the peduncle class were as follows: 71.78%, 54.72%, and 62.10% for precision, recall, and F1-score rates. These metrics fall in the average range. In addition, the values obtained can be explained due to the number of peduncle-type objects, and individual areas existing in the dataset, in contrast to those of the fruit-type objects.
(5)
The model achieves an average inference time of 1.18 s per image. The model manages to approach the real-time requirement for autonomous sweet pepper harvesting, drastically surpassing the times obtained by previously reported green sweet pepper peduncle and fruit detection models.
At present, the method used in this paper can perform real-time fruit and peduncle detection of the most common color varieties of sweet pepper (red, green, yellow, and orange), focusing on green sweet pepper, within a medium-high range in general, good for the fruit, and regular for the peduncle, due to the existing imbalance in the number of class objects, in addition to the unchangeable difference in area per class object.
In future research, due to the results obtained, the following points will be sought to address:
(1)
Improvement of model hyperparameters: time is key in model improvement, experimentation with different backbones, number of epochs, learning rates, and the type of optimizer are crucial to improving overall results.
(2)
Integration into a complete sweet pepper harvesting framework: the generation of this sweet pepper detection method always aimed to integrate it into a complete functional implementation.
(3)
Dataset growth: it is widely known that the more examples of an object we have, the easier it will be to recognize the characteristics that distinguish it from its environment.
(4)
Improvement of image quality: despite the fact we use high-resolution images, the complexity of environments where it is necessary to capture specific features to be able to differentiate the targets from the background (overlapping of fruits, peduncles, leaves and stems of the main crop, variation light intensity, shadows, among others) typically benefits from a greater focus on image quality, so various methods to improve the quality of the input image [42,43,44,45] could be of great help to obtain better results in the accuracy of future implementations.
(5)
Improvement of the model structure: in recent years, visual attention modules and layers have been used to improve the results of computer vision and deep learning implementations. In the case of target detection, it has allowed feature maps to be much more responsive to target features; ergo, layers and similar modules could be of significant help to improve the detection metrics of future implementations of the proposed method.

Author Contributions

Conceptualization, J.D.L.-B., J.A.E.C. and L.-E.M.-C.; methodology, J.D.L.-B., J.A.E.C., A.G.-E. and L.-E.M.-C.; software, J.D.L.-B.; validation, J.D.L.-B.; formal analysis, J.D.L.-B., J.A.E.C. and A.G.-E.; investigation, J.D.L.-B.; resources, J.A.E.C.; data curation, J.D.L.-B.; writing—original draft preparation, J.D.L.-B.; writing—review and editing, J.D.L.-B., J.A.E.C., A.G.-E. and L.-E.M.-C.; visualization, J.D.L.-B.; supervision, J.A.E.C., A.G.-E. and L.-E.M.-C.; project administration, J.A.E.C. and A.G.-E.; funding acquisition, J.A.E.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available on GitHub at [46].

Acknowledgments

Authors would like to acknowledge Zhao Keyuan for his support in labeling some of the masks in the images, and CAET Truper (Centro de Apoyo Educativo Truper) for the grant awarded to support the first author’s studies.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tang, Y.; Chen, M.; Wang, C.; Luo, L.; Li, J.; Lian, G.; Zou, X. Recognition and Localization Methods for Vision-Based Fruit Picking Robots: A Review. Front. Plant Sci. 2020, 11, 510. [Google Scholar] [CrossRef]
  2. Zhou, X.; Ma, Y.; Dai, X.; Li, X.; Yang, S. Spread and Industry Development of Pepper in China. Acta Hortic. Sinica 2020, 47, 1715–1726. Available online: http://www.ahs.ac.cn/EN/10.16420/j.issn.0513-353x.2020-0103 (accessed on 6 October 2022).
  3. Montoya-Cavero, L.-E.; Díaz de León Torres, R.; Gómez-Espinosa, A.; Escobedo Cabello, J.A. Vision Systems for Harvesting Robots: Produce Detection and Localization. Comput. Electron. Agric. 2022, 192, 106562. [Google Scholar] [CrossRef]
  4. Zitter, L. Berry Picking at Its Best with AGROBOT Technology. 2019. Available online: https://www.farmingtechnologytoday.com/news/autonomous-robots/berry-picking-at-its-best-with-agrobot-technology.html (accessed on 6 October 2022).
  5. Ji, C.; Feng, Q.C.; Yuan, T.; Tan, Y.Z.; Li, W. Development and performance analysis on cucumber harvesting robot system in greenhouse. Robot 2011, 33, 726–730. [Google Scholar]
  6. Thorne, J. Apple-Picking Robots Gear Up for U.S. Debut in Washington State. 2019. Available online: https://www.geekwire.com/2019/apple-picking-robots-gear-u-s-debut-washington-state/ (accessed on 6 October 2022).
  7. Saunders, S. The Robots That Can Pick Kiwi-Fruit. 2022. Available online: https://www.bbc.com/future/bespoke/follow-the-food/the-robots-that-can-pick-kiwifruit.html (accessed on 6 October 2022).
  8. Arad, B.; Balendonck, J.; Barth, R.; Ben-Shahar, O.; Edan, Y.; Hellström, T.; Hemming, J.; Kurtser, P.; Ringdahl, O.; Tielen, T.; et al. Development of a sweet pepper harvesting robot. J. Field Robot. 2020, 37, 1027–1039. [Google Scholar] [CrossRef]
  9. Barnea, E.; Mairon, R.; Ben-Shahar, O. Colour-agnostic shape-based 3D fruit detection for crop harvesting robots. Biosyst. Eng. 2016, 146, 57–70. [Google Scholar] [CrossRef]
  10. Vitzrabin, E.; Edan, Y. Adaptive thresholding with fusion using a RGBD sensor for red sweet-pepper detection. Biosyst. Eng. 2016, 146, 45–56. [Google Scholar] [CrossRef]
  11. Song, Y.; Glasbey, C.; Horgan, G.; Polder, G.; Dieleman, J.; van der Heijden, G. Automatic fruit recognition and counting from multiple images. Biosyst. Eng. 2014, 118, 203–215. [Google Scholar] [CrossRef]
  12. Lee, B.; Kam, D.; Min, B.; Hwa, J.; Oh, S. A Vision Servo System for Automated Harvest of Sweet Pepper in Korean Greenhouse Environment. Appl. Sci. 2019, 9, 2395. [Google Scholar] [CrossRef]
  13. Yu, Y.; Zhang, K.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on Mask-RCNN. Comput. Electron. Agric. 2019, 163, 104846. [Google Scholar] [CrossRef]
  14. Sa, I.; Lehnert, C.; English, A.; McCool, C.; Dayoub, F.; Upcroft, B.; Perez, T. Peduncle detection of sweet pepper for autonomous crop harvesting—Combined color and 3-D information. IEEE Robot. Autom. Lett. 2017, 2, 765–772. [Google Scholar] [CrossRef]
  15. Li, H.; Huang, M.; Zhu, Q.; Guo, Y. Peduncle Detection of Sweet Pepper Based on Color and 3D Feature; ASABE: St. Joseph, MI, USA, 2018; p. 1. [Google Scholar] [CrossRef]
  16. Zhang, F.; Gao, J.; Zhou, H.; Zhang, J.; Zou, K.; Yuan, T. Three-Dimensional Pose Detection method Based on Keypoints Detection Network for Tomato Bunch. Comput. Electron. Agric. 2022, 195, 106824. [Google Scholar] [CrossRef]
  17. Lufeng, L.; Wei, Y.; Zhengtong, N.; Jinhai, W.; Huiling, W.; Weilin, C.; Qinghua, L. In-field pose estimation of grape clusters with combined point cloud segmentation and geometric analysis. Comput. Electron. Agric. 2022, 200, 107197. [Google Scholar] [CrossRef]
  18. Wang, J.; Zhang, Z.; Luo, L.; Wei, H.; Wang, W.; Chen, M.; Luo, S. DualSeg: Fusing Transformer and CNN Structure for Image Segmentation in Complex Vineyard Environment. Comput. Electron. Agric. 2023, 206, 107682. [Google Scholar] [CrossRef]
  19. Benavides, M.; Cantón-Garbín, M.; Sánchez-Molina, J.A.; Rodríguez, F. Automatic Tomato and Peduncle Location System Based on Computer Vision for Use in Robotized Harvesting. Appl. Sci. 2020, 10, 5887. [Google Scholar] [CrossRef]
  20. Zu, L.; Zhao, Y.; Liu, J.; Su, F.; Zhang, Y.; Liu, P. Detection and Segmentation of Mature Green Tomatoes Based on Mask R-CNN with Automatic Image Acquisition Approach. Sensors 2021, 21, 7842. [Google Scholar] [CrossRef] [PubMed]
  21. Rong, J.; Dai, G.; Wang, P. A peduncle detection method of tomato for autonomous harvesting. Complex Intell. Syst. 2021, 8, 2955–2969. [Google Scholar] [CrossRef]
  22. Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning for real-time fruit detection and orchard fruit load estimation: Benchmarking of ‘MangoYOLO’. Precis. Agric. 2019, 20, 1107–1135. [Google Scholar] [CrossRef]
  23. Kamilaris, A.; Prenafeta-Boldu, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
  24. Dias, P.A.; Tabb, A.; Medeiros, H. Apple flower detection using deep convolutional networks. Comput. Ind. 2018, 99, 17–28. [Google Scholar] [CrossRef]
  25. Yuncheng, Z.; Tongyu, X.; Wei, Z.; Hanbing, D. Classification and recognition approaches of tomato main organs based on DCNN. Trans. Chin. Soc. Agric. Eng. 2017, 33, 219–226. [Google Scholar] [CrossRef]
  26. Fu, L.; Feng, Y.; Tola, E.; Liu, Z.; Li, R.; Cui, Y. Image recognition method of multi-cluster kiwifruit in field based on convolutional neural networks. Trans. Chin. Soc. Agric. Eng. 2018, 34, 205–211. [Google Scholar] [CrossRef]
  27. Sa, I.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. DeepFruits: A Fruit Detection System Using Deep Neural Networks. Sensors 2016, 16, 1222. [Google Scholar] [CrossRef] [PubMed]
  28. Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
  29. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
  30. Montoya Cavero, L.E. Sweet Pepper Recognition and Peduncle Pose Estimation. Master’s Thesis, Instituto Tecnológico y de Estudios Superiores de Monterrey, Monterrey, Nuevo León, México, 3 December 2021. Available online: https://hdl.handle.net/11285/648430 (accessed on 6 October 2022).
  31. Dutta, A.; Zisserman, A. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2276–2279. [Google Scholar] [CrossRef]
  32. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
  33. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; pp. 91–99. [Google Scholar] [CrossRef]
  34. He, K.; Zhang, X.; Ren, S. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
  35. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
  36. Zhang, W.; Witharana, C.; Liljedahl, A.K.; Kanevskiy, M. Deep Convolutional Neural Networks for Automated Characterization of Arctic Ice-Wedge Polygons in Very High Spatial Resolution Aerial Imagery. Remote Sens. 2018, 10, 1487. [Google Scholar] [CrossRef]
  37. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
  38. Ting, K.M. Confusion Matrix. In Encyclopedia of Machine Learning and Data Mining; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2017; p. 260. [Google Scholar] [CrossRef]
  39. Yang, Q.; Xiao, D.; Lin, S. Feeding behavior recognition for group-housed pigs with the Faster R-CNN. Comput. Electron. Agric. 2018, 155, 453–460. [Google Scholar] [CrossRef]
  40. Chinchor, N. MUC-4 Evaluation Metrics. In Proceedings of the MUC4 92: Conference on Message Understanding, Stroudsburg, PA, USA, 16–18 June 1992; pp. 22–29. [Google Scholar] [CrossRef]
  41. Abdulla, W. Mask R-CNN for Object Detection and Instance Segmentation on Keras and TensorFlow. GitHub Repos. 2017. Available online: https://github.com/matterport/Mask_RCNN (accessed on 1 November 2022).
  42. Min, X.; Gu, K.; Zhai, G.; Liu, J.; Yang, X.; Chen, C.W. Blind Quality Assessment Based on Pseudo Reference Image. IEEE Trans. Multimed. 2017, 20, 2049–2062. [Google Scholar] [CrossRef]
  43. Min, X.; Zhai, G.; Gu, K.; Liu, Y.; Yang, X. Blind Image Quality Estimation via Distortion Aggravation. IEEE Trans. Broadcast. 2018, 64, 508–517. [Google Scholar] [CrossRef]
  44. Zhai, G.; Min, X. Perceptual Image Quality Assessment: A Survey. Sci. China Inf. Sci. 2020, 63, 211301. [Google Scholar] [CrossRef]
  45. Min, X.; Zhai, G.; Zhou, J.; Farias, M.C.; Bovik, A.C. Study of Subjective and Objective Quality Assessment of Audio-Visual Signals. IEEE Trans. Image Process. 2020, 29, 6054–6068. [Google Scholar] [CrossRef]
  46. López-Barrios, J.D. Green Sweet Pepper Detection Using Mask R-CNN in Greenhouses Documentation. GitHub Repos. 2022. Available online: https://github.com/dassdinho/green_sweet_pepper_detection_using_mask_rcnn (accessed on 31 December 2022).
Figure 1. Fruits and peduncles of green sweet peppers in a greenhouse with variable light conditions and the presence of occlusion by other crops and leaves.
Figure 1. Fruits and peduncles of green sweet peppers in a greenhouse with variable light conditions and the presence of occlusion by other crops and leaves.
Applsci 13 06296 g001
Figure 2. View from inside the sweet pepper greenhouses of the Experimental Agricultural Field of the Tec de Monterrey campus Querétaro (CAETEC).
Figure 2. View from inside the sweet pepper greenhouses of the Experimental Agricultural Field of the Tec de Monterrey campus Querétaro (CAETEC).
Applsci 13 06296 g002
Figure 3. Image acquisition diagram.
Figure 3. Image acquisition diagram.
Applsci 13 06296 g003
Figure 4. Example of instance segmentation sweet pepper dataset with computed bounding boxes (ground truth): (a) original image; (b) visualization of mask image with computed bounding boxes.
Figure 4. Example of instance segmentation sweet pepper dataset with computed bounding boxes (ground truth): (a) original image; (b) visualization of mask image with computed bounding boxes.
Applsci 13 06296 g004
Figure 5. Mask R-CNN architecture for sweet pepper detector training.
Figure 5. Mask R-CNN architecture for sweet pepper detector training.
Applsci 13 06296 g005
Figure 6. Curves of the training and validation loss values per epoch for the Mask R-CNN model. Validation loss value at its minimum in epoch 30.
Figure 6. Curves of the training and validation loss values per epoch for the Mask R-CNN model. Validation loss value at its minimum in epoch 30.
Applsci 13 06296 g006
Figure 7. Loss graph for Mask R-CNN model. Training and validation Mask R-CNN mask loss values per epoch.
Figure 7. Loss graph for Mask R-CNN model. Training and validation Mask R-CNN mask loss values per epoch.
Applsci 13 06296 g007
Figure 8. Loss graphs for Mask R-CNN model: (a) Mask R-CNN bounding box refinement loss; (b) Mask R-CNN classifier loss.
Figure 8. Loss graphs for Mask R-CNN model: (a) Mask R-CNN bounding box refinement loss; (b) Mask R-CNN classifier loss.
Applsci 13 06296 g008
Figure 9. Loss graphs for Mask R-CNN model: (a) RPN bounding box loss; (b) RPN anchor classifier loss.
Figure 9. Loss graphs for Mask R-CNN model: (a) RPN bounding box loss; (b) RPN anchor classifier loss.
Applsci 13 06296 g009
Figure 10. Sweet pepper fruits and peduncles detection and instance segmentation masks display: (a,c,e) Original images; (b,d,f) Images with resulting detections.
Figure 10. Sweet pepper fruits and peduncles detection and instance segmentation masks display: (a,c,e) Original images; (b,d,f) Images with resulting detections.
Applsci 13 06296 g010
Figure 11. More examples of visualization of sweet pepper fruit and peduncle instance segmentation: (a,c,e) ground truth; (b,d,f) results.
Figure 11. More examples of visualization of sweet pepper fruit and peduncle instance segmentation: (a,c,e) ground truth; (b,d,f) results.
Applsci 13 06296 g011
Table 1. Comparison of the number of class objects per dataset.
Table 1. Comparison of the number of class objects per dataset.
Class Object TypesBase DatasetFinal Complemented Dataset
Total fruits532414,194
Train fruit429411,437
Val fruit10302757
Total peduncles10972993
Train peduncle8992385
Val peduncle198608
Table 2. Confusion matrix of the Mask R-CNN sweet pepper fruit and peduncle model.
Table 2. Confusion matrix of the Mask R-CNN sweet pepper fruit and peduncle model.
Ground TruthPredicted Class
FruitPeduncleBackground
Fruit9070241
Peduncle0145120
Background16657/
Table 3. Precision rate, recall rate, and F1-score of the trained Mask R-CNN sweet pepper fruit and peduncle model.
Table 3. Precision rate, recall rate, and F1-score of the trained Mask R-CNN sweet pepper fruit and peduncle model.
Evaluation ParameterFruitPeduncleOverall
Precision rate %84.5371.7878.16
Recall rate %79.0154.7266.86
F1-score %81.6762.1071.89
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

López-Barrios, J.D.; Escobedo Cabello, J.A.; Gómez-Espinosa, A.; Montoya-Cavero, L.-E. Green Sweet Pepper Fruit and Peduncle Detection Using Mask R-CNN in Greenhouses. Appl. Sci. 2023, 13, 6296. https://doi.org/10.3390/app13106296

AMA Style

López-Barrios JD, Escobedo Cabello JA, Gómez-Espinosa A, Montoya-Cavero L-E. Green Sweet Pepper Fruit and Peduncle Detection Using Mask R-CNN in Greenhouses. Applied Sciences. 2023; 13(10):6296. https://doi.org/10.3390/app13106296

Chicago/Turabian Style

López-Barrios, Jesús Dassaef, Jesús Arturo Escobedo Cabello, Alfonso Gómez-Espinosa, and Luis-Enrique Montoya-Cavero. 2023. "Green Sweet Pepper Fruit and Peduncle Detection Using Mask R-CNN in Greenhouses" Applied Sciences 13, no. 10: 6296. https://doi.org/10.3390/app13106296

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop