An Overview of the Application of Machine Vision in Recognition and Localization of Fruit and Vegetable Harvesting Robots

Hou, Guangyu; Chen, Haihua; Jiang, Mingkun; Niu, Runxin

doi:10.3390/agriculture13091814

Open AccessReview

An Overview of the Application of Machine Vision in Recognition and Localization of Fruit and Vegetable Harvesting Robots

¹

Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

²

Science Island Branch, University of Science and Technology of China, Hefei 230026, China

³

Institute of Computer Science, Chinese Academy of Sciences, Beijing 100190, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2023, 13(9), 1814; https://doi.org/10.3390/agriculture13091814

Submission received: 4 August 2023 / Revised: 30 August 2023 / Accepted: 12 September 2023 / Published: 14 September 2023

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Intelligent agriculture imposes higher requirements on the recognition and localization of fruit and vegetable picking robots. Due to its unique visual information and relatively low hardware cost, machine vision is widely applied in the recognition and localization of fruit and vegetable picking robots. This article provides an overview of the application of machine vision in the recognition and localization of fruit and vegetable picking robots. Firstly, the advantages, disadvantages, and the roles of different visual sensors and machine vision algorithms in the recognition and localization of fruit and vegetable picking robots are introduced, including monocular cameras, stereo cameras, structured light cameras, multispectral cameras, image segmentation algorithms, object detection algorithms, and 3D reconstruction algorithms. Then, the current status and challenges faced by machine vision in the recognition and localization of fruit and vegetable picking robots are summarized. These challenges include the stability of fast recognition under complex background interference, stability of recognition under different lighting environments for the same crop, the reliance of recognition and localization on prior information in the presence of fruit overlap and occlusions caused by leaves and branches, and the uncertainty of picking caused by complex working environments. In current research on algorithms dealing with complex background interference and various occlusion disturbances, good results have been achieved. Different lighting environments have a significant impact on the recognition and positioning of fruits and vegetables, with a minimum accuracy of 59.2%. Finally, this article outlines future research directions to address these challenges.

Keywords:

machine vision; fruit and vegetable harvesting robots; image recognition; visual localization

1. Introduction

As crop yields increase year by year, the labor cost for complex harvesting tasks has significantly risen. To cope with this trend, research on agricultural equipment is increasingly inclined toward unmanned solutions. Both domestic and international studies have been conducted on robotic harvesting for a variety of crops, including apples, oranges, kiwis, lychees, strawberries, grapes, mangoes, tomatoes, cucumbers, and more. This field of research is attracting more attention because robotic harvesting of fruits and vegetables can reduce harvesting costs and provide a new pathway for mechanized picking.

However, robotic harvesting of fruits and vegetables faces complex work environments and numerous uncertainties, making it difficult to meet the requirements for accuracy, speed, and stability in practical applications, thereby increasing the difficulty of harvesting. To achieve efficient and rapid fruit and vegetable picking operations, precise and stable identification and positioning systems are required [1]. Such systems are mainly based on machine vision, which refers to the combination of computational algorithms and image information. This includes the application of visual sensors and machine vision algorithms [2]. Visual sensors are used to acquire image information and the three-dimensional positioning of targets. Common visual sensors include monocular cameras [3], stereo cameras [4], structured cameras [5], and multispectral cameras [6]. Machine vision algorithms are used for target recognition. Common algorithms include image segmentation, object detection, and three-dimensional reconstruction. Image segmentation algorithms can segment target objects in complex scenes [7], object detection algorithms can timely detect target objects and other interferences in images [8], while three-dimensional reconstruction algorithms can convert the two-dimensional image information obtained by the camera into three-dimensional spatial information of the target [9].

Accurate and stable identification and positioning systems directly affect the operational efficiency of harvesting robots, whether crops are subject to mispicking and damage, as well as issues related to mechanical vibration and collision damage of the robots themselves. In practical harvesting operations, the precision of target recognition and localization is influenced by various factors, and the resolution of target recognition and localization problems needs to be considered from multiple perspectives. (1) Interference from complex work environments, including interference from complex backgrounds [10]; variations in natural lighting [11]; fruit overlapping and obstruction from branches and leaves [12]. (2) Real-time issues with target detection and localization. (3) Harvesting robots face a limited range of target objects and can only recognize targets based on pre-trained models; thus, the algorithms need to enhance generalization abilities to detect multiple types of fruits. (4) The recognition accuracy of visual sensors is low, resulting in low precision of subsequent three-dimensional matching and target positioning [13]. (5) The actual working conditions are complex and variable, requiring enhanced robustness of the robot control system. Machine vision faces significant challenges in the recognition and positioning of fruit and vegetable harvesting robots. Therefore, accurate and applicable fruit and vegetable identification and positioning are crucial.

Although there have been reviews on the target recognition and localization technology of harvesting robots [14,15,16,17,18], the main contributions of this paper are as follows: (1) This paper focuses on the analysis of the issues faced by fruit and vegetable harvesting robots in recognition and positioning, and systematically summarizes and compares various solution methods, evaluating their applicability. (2) This paper provides a more comprehensive overview, introducing and analyzing a wider range of methods. It includes the application of three-dimensional reconstruction algorithms in target recognition and positioning, and highlights the improvement methods and advantages and disadvantages of the YOLO model, a major object detection algorithm. (3) This paper further summarizes and prospects the target recognition and positioning technology for fruit and vegetable harvesting robots, taking it to new heights. It expands from the initial integration of recognition and positioning techniques with sensors to the combination.

This paper introduces the progress and representative research achievements in the field of target recognition and localization for fruit picking robots in recent years. It is elaborated in the following five aspects: (1) The introduction of various visual sensors, their advantages, disadvantages, and roles when combined with different visual sensors. (2) The introduction of the advantages and disadvantages of image segmentation algorithms, object detection algorithms, and target three-dimensional reconstruction algorithms in the recognition and localization of fruit picking robots. (3) An overview of the current situation of machine vision in indoor and outdoor greenhouse environments for recognition and localization purposes. (4) A summary of the challenges faced by machine vision in the recognition and localization of fruit picking robots, including stability in fast recognition under complex background interference, stability in recognition under different lighting conditions for the same crop, reliance on prior information for recognition and localization when fruits are overlapping or obstructed by leaves and branches, and uncertainty in harvesting due to complex working environments. (5) Concluding remarks and future research directions are provided.

2. Vision Recognition and Positioning System for Fruit and Vegetable Harvesting Robots

2.1. Visual Sensors

According to Figure 1A, visual sensors can be divided into 2D image sensors and 3D image sensors based on whether they simultaneously acquire depth information. The information obtained by 2D image sensors includes color, shape, texture, and other morphological characteristics. On the other hand, 3D image sensors can extract the three-dimensional information of complex target fruits and vegetables, obtaining richer data, such as the spatial coordinates of the target fruits and vegetables. Common visual sensors include monocular cameras, stereo cameras, structured light cameras, and multispectral cameras. Based on the selection of 49 publications, structured light cameras are the most commonly used visual sensors, as shown in Figure 1B.

Figure 1. (A) Classification of visual sensors. (B) Number of publications selected for each type of visual sensor. (C) Census algorithm flow [4]. (D) Three-dimensional information acquisition model [4]. (E) Basic principles of depth information acquisition by structured light cameras [5]. (F) FPGA implementation of stereo matching [19].

Therefore, the selection of visual sensors for fruit and vegetable harvesting robots varies depending on the growing environment. Please refer to Table 1 for specific information. This table summarizes the application principles and advantages and disadvantages of commonly used visual sensing technologies for fruit and vegetable harvesting robots.

2.1.1. Monocular Camera

In the field of agriculture, monocular cameras have been widely used for target localization in harvesting robots. The main challenge in monocular 3D object detection is the lack of depth information to infer the distance of objects [3]. Therefore, combining visual algorithms can address this weakness. The Beijing Institute of Intelligent Agriculture Equipment utilizes a monocular camera to calculate the accurate coordinates of peripheral fruits by measuring the deviation between the top fruit center and the image center [20] (as shown in Figure 2A). The University of KU Leuven in Belgium has designed an apple harvesting robot [21] that uses a monocular camera mounted at the center of a flexible gripper, ensuring the alignment between the gripper and the camera, simplifying the image-to-robot-end-effector coordinate transformation and reducing image distortion (as shown in Figure 2B). When approaching an apple, the remaining distance to the apple is calculated through triangulation. The Graduate School of Agricultural Science at the Iwate University [22] uses a monocular camera to locate the center of apple fruit and the detachment layer of the stem under natural lighting conditions, establishing their geometric relationship with a success rate of over 90%. Zhao et al. [23] identify apples using a monocular camera based on color and texture features, achieving an accuracy rate of 90%. The Industrial and Systems Engineering Department at the University of Florida employs a single-camera system to obtain the three-dimensional position of citrus fruits for localization [24]. Its characteristics are shown in Table 1. In the future, the application of monocular cameras in agriculture needs to be further improved by combining other technological means.

Table 1. Common visual sensors used in picking robots and their application principles and advantages and disadvantages.

Types of Sensors	Applications and Principles	Advantages	Disadvantages
Monocular camera	Color, shape, texture, and other features	Simple system structure, low cost, can be combined with multiple monocular systems to form a multi-camera system	It can only capture two-dimensional image information, has poor stability, and cannot be used in dark or low-light conditions [25]
Stereo camera	Texture, color, and other features; obtaining the spatial coordinates of the target through the principle of triangulation imaging	By combining algorithms, the matching efficiency can be improved, and three-dimensional coordinate information can be obtained	It requires high sensor calibration accuracy, and the stereo matching computation takes a long time. It is also challenging to determine the three-dimensional position of edge points
Structured camera	Obtaining three-dimensional features through the reflection of structured light by the object being measured	The three-dimensional features are not easily affected by background interference and have better positioning accuracy	Sunlight can cancel out most of the infrared images, and the cost is high
Multispectral camera	Identifying targets based on the differences in radiation characteristics of different wavelength bands	It is not easily affected by environmental interference	It requires heavy computational processing, making it unsuitable for real-time picking operations

Monocular 3D object detection lacks depth information, and some researchers utilize the combination of monocular vision and LiDAR for recognition and localization tasks. Cao [26] utilizes two subsystems, LiDAR and monocular vision SLAM, for pose estimation. The visual subsystem overcomes the limitation of the LiDAR subsystem, which can only perform local calculations on geometric features, by detecting the entire line segments. By adjusting the direction of linear feature points, a more accurate odometry system is achieved. This system realizes more accurate pose estimation. Shu [27] proposes a recognition and localization method that includes LiDAR feature extraction, visual feature extraction, tracking, and visual feature depth recovery. By fusing inter-frame visual feature tracking and LiDAR feature matching, a frame-to-frame odometry module is established for rough pose estimation. Zhang [28] designs a visual system that trains using a database obtained by coupling LiDAR measurements with a complementary 360-degree camera. Cheng [29] based on distance, divides the depth map obtained from LiDAR and monocular images into various sub-images and treats each sub-image as an individual image for feature extraction. Each sub-image only contains pixels within a learning interval, thus obtaining rich pixel depth information. Ma [30] uses the original image to generate pseudo-LiDAR and bird’s-eye view, and then inputs the fused data of the original image and pseudo-LiDAR into a keypoint-based network for initial 3D box estimation.

2.1.2. Stereo Camera

Stereoscopic cameras capture depth information by simulating the parallax of human eyes. They utilize stereo matching algorithms to determine the spatial positions of picking points (as shown in Figure 1C,D). The stereo matching algorithm calculates the disparity by subtracting the horizontal coordinates of matching points in the left and right images and then uses trigonometric measurement formulas to compute the baseline distance, the average distance between the two cameras, and the distance between the branch skeleton feature points and the cameras [4].

Based on the stereo vision approach, the depth can be determined by utilizing the disparity between two monocular cameras [13]. The University of Florida proposed the use of multiple monocular cameras for fruit localization in the context of robotic harvesting [31]. Edan [32] developed a stereo vision system using two monocular cameras for the detection and positioning of watermelons. The Queensland University of Technology in Australia [33] captured images of dynamic clusters of lychee using two monocular cameras and employed stereo vision matching to calculate the spatial positions of interfered lychee clusters for picking (as shown in Figure 2C,F). The depth accuracy range for the visual localization of lychee picking points was 0.4 cm to 5.8 cm. As shown in Figure 2D, the Chinese Academy of Agricultural Mechanization Sciences [34] used a mono-stereo vision sensor system to obtain the three-dimensional coordinates of apples. Mrovlje [35] replaced two monocular cameras with a binocular stereo camera system to calculate the position of objects using stereo vision matching.

Pal [36] uses the continuous triangulation method to generate a three-dimensional point cloud of the stereo camera. Wei [37] obtains the three-dimensional information of apple tree obstacles through a binocular stereo camera. The binocular stereo system used by Wuhan University of Technology [19] is implemented with the census stereo matching algorithm and FPGA, as shown in Figure 1F. Guo [38] captures images of lychee clusters using a binocular stereo vision system to obtain feature information such as the centroid and minimum bounding rectangle of lychee fruits, thereby determining the picking points. Jiang [39] conducts recognition and localization research on tomatoes and oranges in greenhouse environments using a binocular stereo vision system. The Dutch Greenhouse Technology Company [40] calculates the three-dimensional coordinates of cucumber stems using a stereo vision system. The Institute of Agricultural and Environmental Engineering [41] utilizes a binocular stereo vision system to capture images of cucumber scenes and, thus, calculate the three-dimensional positions of cucumbers. Shown in Figure 2G,H is a cucumber harvesting robot.

The advantages and disadvantages of stereo vision methods are shown in Table 1. When measuring targets, the depth of target edges may be lost, and this phenomenon worsens as the distance decreases. Therefore, it is difficult to accurately determine the three-dimensional positions of key points placed on target edges.

Figure 2. (A) Tomato harvesting robot [20]. (B) Apple picking robot [21]. (C) Image processing of Lychee clusters in the orchard. (a,c) Lychee image, (b,d) calculation of picking points [33]. (D) Recognition and positioning system for harvesting robots [34]. (E) Components of harvesting robots [42]. (F) Lychee harvesting robot [33]. (G) Cucumber harvesting robot in the IMAG greenhouse [41]. (H) Cucumber harvesting robot in the IMAG greenhouse [41]. (I) Components of citrus harvesting robot [43]. (J) Hyperspectral imaging system [44]. (K) Region image acquisition [45]. (L) Tomato cluster harvesting robot system [46]. (M) NIR image [47]. (N) Visible spectrum image [47]. (O) Camera capturing pepper images with different levels of occlusion [47]. (P) Recognition and positioning in the orchard [48]. (Q) Illumination system layout with three lamps [49]. (R) Illumination system layout with three lamps [49]. (S) Strawberry end effector [50].

2.1.3. Structured Camera

A structured light camera is a type of camera device that can capture surface details and geometric shapes of objects. It consists of one or more monocular cameras and a projector. Figure 1E illustrates the basic principle of obtaining depth information using a structured light camera [5]. It projects a series of known patterns onto the scene and performs correspondences between the projected frames and captured frames to obtain depth information based on the degree of deformation [51,52]. Structured light camera systems are widely used in the field of robotics [53]. They have the ability to measure depth with high precision and accuracy, as well as high speed [54], and can obtain the three-dimensional coordinates of objects from the depth information in images [55].

The National Center of Research and Development for Harvesting Robots in Japan [42] utilizes a structured light camera to determine the three-dimensional positions of picking targets (as shown in Figure 2E). The structured light camera is equipped with two infrared cameras, an infrared projector, and an RGB camera. It uses the two infrared cameras as a stereo camera to provide depth information and overlays color information with depth information using the RGB camera. The infrared projector projects image information to improve the accuracy of depth information. However, due to sunlight interference in outdoor environments during the daytime, most of the infrared images are offset.

Sa [1] combined the detection results from color and near-infrared images to establish a deep fruit localization system. Shimane University [56] constructed a cherry tomato harvesting robot equipped with a structured camera that uses three position-sensitive devices to detect the reflected beams of crops. By scanning laser beams, the shape and position of the crops are determined. Wang Z. [57] used a structured camera to measure the size bounding box of mangoes. Rong [58] employed a structured camera to detect the pedicels of tomatoes. South China University of Technology [46] utilized a structured camera for tomato target recognition and localization, as shown in Figure 2L. They obtained the optimal depth value of the harvesting point by comparing the difference between the depth average value and the original depth value of the picking point. The success rate of tomato harvesting point localization reached 93.83%. Nankai University [59] used a structured camera to acquire depth information on vegetables. The Henan University of Science and Technology [60] conducted edge recognition of leafy vegetables using a RealSense D415 depth camera. They extracted coordinate values in pixel coordinates and converted the pixel coordinates of the seedling edge to extremum coordinates. The researchers noticed that the RealSense D415 depth camera had calibration errors and the errors increased with the distance between the measured object and the depth camera. A ZED depth camera consists of left and right lenses and has a parallel optical axis [61]. The Graduate School of Science and Technology [62] at the University of Tsukuba used the ZED depth camera to obtain image information of pear orchards. The right lens produced depth images, while the left lens produced 4-channel RGB original images. When the operator’s distance to the pears was less than 50 cm, the ZED depth camera could obtain the distance and position of the pears through its depth function. Yang [43] used the Kinect V2 depth camera to obtain the three-dimensional coordinates of citrus picking targets. As shown in Figure 2I, they established the conversion relationship from the pixel coordinate system to the camera coordinate system by calibrating the camera and obtaining the intrinsic and extrinsic parameter matrices. By setting the top left and bottom right points of the position bounding box as the target points, they determined the three-dimensional coordinates of the target points. Their harvesting success rate reached 80.51%.

2.1.4. Multispectral Camera

With the development of spectral imaging technology, multispectral cameras have been used for fruit identification. By integrating spectral and imaging technologies in one system, multispectral cameras can obtain monochrome images at continuous wavelengths [6]. Safren [44] used a hyperspectral camera to detect green apples in the visible and near-infrared regions, as shown in Figure 2J. Okamoto et al. [45] proposed a method for green citrus identification using a hyperspectral camera with a spectral width between 369 and 1042 nm, as shown in Figure 2K. They achieved the correct identification of 80–89% of citrus fruits in the foreground of the validation set. Queensland University of Technology in Australia [47] proposed a multispectral detection system for field pepper detection. A set of LEDs was installed behind the multispectral camera to reduce interference from natural light. By using a threshold with near-infrared wavelengths (>900 nm) to segment the background from vegetation and a threshold with blue wavelength (447 nm) to remove non-vegetation objects in the scene, 69.2% of field peppers were detected at three locations, as shown in Figure 2M–O. Bac [63] developed a robotic harvesting system based on six-band multispectral data with a spectral width between 40 and 60 nm. The method produced fairly accurate segmentation results in a greenhouse environment, but the constructed obstacle map was not accurate. References [64,65] designed a greenhouse cucumber harvesting robot based on spectral characteristics. Cucumber recognition was conducted by utilizing the spectral differences between cucumbers and leaves, achieving an accuracy rate of 83.3%.

2.2. Machine Vision Algorithms

Image segmentation, object detection, and 3D reconstruction are key technologies for the recognition and localization of fruit and vegetable harvesting robots. In Figure 3, different algorithms are shown, including branch and leaf segmentation, fruit and vegetable image segmentation, fruit and vegetable image detection, branch and leaf detection, and fruit and vegetable detection. These algorithms provide effective methods for image recognition and localization of fruit and vegetable harvesting robots.

2.2.1. Image Segmentation Algorithms

In the process of fruit and vegetable harvesting robots, segmentation algorithms are used to distinguish fruits and vegetables from the background and separate them. Traditional feature-based algorithms for segmentation include the following methods: depth threshold segmentation, shape-based segmentation, similarity measurement, and image binarization segmentation. Machine learning algorithms can utilize a large amount of training data for model learning and training, thereby improving the accuracy and robustness of segmentation. As shown in Table 2, common machine learning segmentation algorithms include semantic segmentation algorithms and instance segmentation algorithms. Figure 4 shows the application proportion of image segmentation algorithms in the context of harvest recognition and localization. The selection of these segmentation algorithms depends on specific application scenarios and requirements and can be chosen and combined based on the actual situation of fruit and vegetable harvesting robots.

(1) Depth thresholding segmentation. Depth threshold segmentation is a method of dividing an image into different blocks by extracting key features. When the value of a specific pixel in the image is greater than the set threshold, it is assigned to the corresponding block for segmentation. The Beijing Intelligent Agriculture Equipment Center highlights the differences between fruit clusters and the background using the R-G color difference model and performs depth threshold segmentation on the fruit cluster region [71], achieving an accuracy rate of 83.5%. Zhao successfully segmented tomato fruits using an adaptive threshold segmentation algorithm, achieving a recognition rate of 93% [72]. Chongqing University of Technology used the HSV threshold segmentation algorithm to identify citrus targets [73]. The combination of deep threshold segmentation algorithm and other algorithms can improve the results of recognition and localization. Suzhou University proposed an algorithm that combines Mask R-CNN and deep threshold segmentation for the recognition and localization of row–frame tomato picking points, and the results showed that the deep threshold algorithm helps in filtering the background, with a successful localization rate of 87.3% and an average detection time of 2.34 s [74]. Zhejiang University of Technology performed initial segmentation on an image based on the deep threshold algorithm [75].

(2) Similarity measure segmentation. Image segmentation methods based on similarity measurement involve the process of classifying or clustering data objects by computing the degree of similarity between two objects. As shown in Figure 3a, Zhejiang University of Technology achieves cucumber image segmentation and detection by individually calculating the normalized cross-correlation (NCC) matrix with the target image [66], with an accuracy rate of 98%. The Suzhou Industrial Park Sub-Bureau proposed a fruit target segmentation algorithm based on a color space reference table, which does not require complex operations on the background and can effectively remove the background, thereby improving the real-time operability of the picking robot [76].

The clustering segmentation algorithm is a type of image segmentation method based on similarity measurement, which divides pixels in the image into different clusters, each representing a particular image region or object. When combined with thresholding and other algorithms, clustering segmentation algorithms can achieve better segmentation results. South China University of Technology combines K-means clustering segmentation algorithm, depth threshold segmentation algorithm, and morphological operations to obtain the image coordinates of tomato cluster picking points [46]. The detection time is 0.054 s with an accuracy rate of 93.83%. The experimental results showed a significant reduction in noise within the region of the stem of interest after removing part of the background using the K-means clustering segmentation algorithm. In the literature, K-means clustering segmentation and Hough circle fitting were applied to achieve image segmentation of citrus fruits [77]. The detection time is 7.58 s with an accuracy rate of 85%. South China Agricultural University used an improved fuzzy C-means clustering method to segment images of lychee fruits and their stems and calculated the picking points for lychee clusters to determine their spatial positions [33,78]. The detection time is 0.46 s with an accuracy rate of 95.3%.

(3) Image binarization segmentation. The Otsu algorithm is a classic image binarization method that uses the grayscale histogram to determine the optimal threshold and convert the image from grayscale space to binary space. Shandong Agricultural University utilizes the Otsu algorithm and maximum connected domain to quickly identify and segment target grape images in two-dimensional space [79]. This method achieves fast and effective separation of grapes and backgrounds based on Otsu, with a recognition success rate of 90.0% and a detection time of 0.061 s.

(4) Shape segmentation algorithm. Shape segmentation refers to the use of specific shapes to segment meaningful regions and extract features of the targets. South China Agricultural University successfully segmented the contour of banana stalks through edge detection algorithms [80]. The detection time is 0.009 s with an accuracy rate of 93.2%. The Zhejiang University of Technology and Washington State University use the Hough circle transform method to identify apples, as it can better recognize all visible apples [75]. The detection time ranges from 0.025 to 0.13 s, with an error rate of less than 4%. However, traditional image segmentation algorithms are sensitive to lighting changes in natural environments, making it difficult to accurately extract information about obstructed fruits and branches.

(5) Semantic segmentation algorithms. Semantic segmentation algorithms are machine learning algorithms that associate each pixel in the image with labels or categories to identify sets of pixels of different categories. Jilin University employs the PSP-Net semantic segmentation model to perform semantic segmentation of the main trunk ROI image, extract the centerline of the lychee trunk, and obtain the pixel coordinates of the picking points in the ROI image to determine the global coordinates of the picking points in the camera coordinate system [81]. The accuracy rate of this method is 98%. Combining semantic segmentation algorithms with other algorithms can achieve better segmentation results. Nanjing Agricultural University proposes a mature cucumber automatic segmentation and recognition method combining YOLOv3 and U-Net semantic segmentation networks. This algorithm uses an improved U-Net model for pixel-level segmentation of cucumbers and performs secondary detection using YOLOv3, yielding accurate localization results [82]. The accuracy rate of this method is 92.5%.

(6) Instance segmentation algorithms. Instance segmentation not only classifies pixels but also classifies each object, thus enabling the simultaneous recognition of multiple objects. In the identification and positioning of fruit and vegetable picking, the main instance segmentation algorithms used are Mask R-CNN and YOLACT. Northwestern A&F University proposed an improved Mask R-CNN algorithm for tomato recognition [7], with a processing time of 0.04 s per image. The accuracy rate of this method is 95.67%. As shown in Figure 3b, detection is performed using YOLACT, Mask R-CNN, and the improved Mask R-CNN, with the improved Mask R-CNN producing more accurate detection results.

Beijing Intelligent Agricultural Equipment Center proposed a tomato plant main stem segmentation model based on Mask R-CNN [83,84]. The accuracy rate of this method is 93%. It locates the centerline of the main stem based on the recognized mask features and then calculates the image coordinates of the harvesting points by offsetting along the lateral branches from the intersection points of the centerline. The University of Lincoln in the UK proposed a Mask R-CNN segmentation detection algorithm with critical point detection capability, which is used to estimate the key harvesting points of strawberries [85]. Yu Y achieved three-dimensional positioning of strawberry harvesting points by extracting the shape and edge features from the mask images generated by Mask R-CNN [86]. The accuracy rate of this method is 95.78%. South China Agricultural University proposed a method for main fruit branch detection and harvesting point localization of lychees based on YOLACT [69]. The detection time is 0.154 s with an accuracy rate of 89.7%. As shown in Figure 3f, this method utilizes the instance segmentation model YOLACT to obtain clusters and masks of the main fruit branches in the lychee images and selects the midpoint as the harvesting point. At the same time, it uses skeleton extraction and least squares fitting to obtain the main axis of the lychee’s main fruit branch mask, thereby obtaining the angle of the lychee’s main fruit branch as a reference for the robot’s harvesting posture. The use of machine-learning-based image segmentation algorithms allows for the separation of targets and backgrounds, enabling robots to more accurately recognize and locate crops. By automatically learning rules from data, the crop recognition and localization capabilities of robots are improved.

2.2.2. Object Detection Algorithm

The object detection algorithm is a key algorithm in the recognition and localization work of fruit and vegetable harvesting robots. Researchers have conducted extensive research on the stability, real-time performance, and accuracy of object detection algorithms. Among them, the YOLO object detection model has the characteristics of fast detection speed, small model size, and easy deployment. Figure 5 shows the commonly used YOLOv7 model structure, while the proportion of YOLO model optimization methods applied in harvest recognition and localization is depicted in Figure 6. In order to improve the performance of the algorithm, researchers have taken some optimization measures, as shown in Table 3, such as adding residual modules (ResNet), changing or replacing the backbone feature extraction network, combining the K-means clustering algorithm, adding attention mechanism modules, improving activation functions, etc.

(1) Introducing residual modules ResNet. The Residual Network (ResNet) module allows the YOLO object detection model to perform deeper processing. The idea behind ResNet is to establish direct connections between preceding and subsequent layers to aid gradient backpropagation [87]. Chongqing University of Posts and Telecommunications introduced residual modules based on YOLOv4 and constructed a new network that enhances small object detection in natural environments [67]. The accuracy rate of this method is 94.44%, with a detection time of 0.093 s. South China Agricultural University incorporated the residual concept into the YOLOv3 model, addressing the issue of decreased detection accuracy due to increased network layer depth [98]. The accuracy rate of this method is 97.07%, with a detection time of 0.017 s.

(2) Modifying or replacing the backbone feature extraction network. The role of the backbone feature extraction network is to extract more feature information from the image for subsequent network usage. Several studies have proposed improved backbone feature extraction network schemes. Hunan Agricultural University [8] designed a new backbone feature extraction network based on YOLOv3 for fast detection of citrus fruits in natural environments. The accuracy rate of this method is 94.3%, with a detection time of 0.01 s. Qingdao Agricultural University [68] combined the fast detection capability of YOLOv3 with the high-precision classification ability of DenseNet201, enabling precise detection of tea shoots. The accuracy rate of this method is 95.71%. DenseNet201 utilizes multiple convolutional kernels on smaller-sized feature maps to extract rich features and mitigate gradient vanishing problems. Shandong Agricultural University [88] replaced the YOLOv5 model with multiple LC3, DWConv, and Conv modules, reducing network parameters while enhancing the fusion of shallow-level features, facilitating the extraction of features from small objects. DWConv divides standard convolution into depthwise convolution and pointwise convolution, promoting information flow between channels and improving operation speed. The accuracy rate of this method is 94.7%, with a detection time of 0.467 s. Dalian University [89] replaced CSPDarknet53 in YOLOv4 with DenseNet to decrease gradient vanishing, strengthen feature transmission, reduce parameter count, and achieve the detection of cherry fruits in natural environments. Northwest A&F University [91] replaced BottleneckCSP with BottleneckCSP-2 in YOLOv5s for accurate detection of apples in natural environments. The accuracy rate of this method is 86.57%, with a detection time of 0.015 s. The Shandong University of Science and Technology [90] designed a lightweight backbone feature extraction detection network, LeanNet, that uses convolutional kernels with different receptive fields to extract distinct perceptual information from the feature map. It generates local saliency maps based on green peach features, effectively suppressing irrelevant regions in the background of branches and leaves for the detection of green peaches in natural environments. The accuracy rate of this method is 97.3%.

(3) Applying the K-means clustering algorithm for combining predicted candidate boxes. Based on YOLOv3, the feature extraction capability of the model can be enhanced by performing K-means clustering analysis on the predicted bounding boxes of leafy objects [43,92]. South China Agricultural University [93] proposed a small-scale lychee fruit detection method based on YOLOv4. This method utilizes the K-means++ algorithm to cluster the labeled frames to determine anchor sizes suitable for lychees. The K-means++ algorithm solves the problem of the K-means algorithm being greatly affected by initial values. The accuracy rate of this method is 79%. Guangxi University [94] proposed an improved YOLOv3 algorithm for cherry tomato detection. This algorithm uses an improved K-means++ clustering algorithm to calculate the scale of the anchor box, thereby extracting more abundant semantic features of small targets and reducing the problems of information loss and insufficient semantic feature extraction of small targets during network transmission in YOLOv3. The accuracy rate of this method is 94.29%, with a detection time of 0.058 s.

(4) Incorporating attention mechanism modules. The attention mechanism module can focus more on the pixel regions in the image that have a decisive role in classification while suppressing irrelevant regions. Northwest A&F University [91] added the SE module to YOLOv5s for fast apple recognition on trees. The accuracy rate of this method is 86.57%, with a detection time of 0.015 s. Jiangsu University [95] integrated the CBAM module into YOLOX-Tiny to achieve fast apple detection in natural environments. The accuracy rate of this method is 96.76%, with a detection time of 0.015 s. Shandong Agricultural University [92] proposed the SE-YOLOv3-MobileNetV1 algorithm for tomato ripeness detection, which achieved significant improvements in terms of speed and accuracy by introducing the SE module. The accuracy rate of this method is 97.5%, with a detection time of 0.227 s.

(5) Enhancing the activation function. Activation functions provide nonlinear functionality in the network structure, playing an important role in network performance and model convergence speed. Improved activation functions can enhance the detection performance of the model. For example, Dalian University [89] replaced the activation function with ReLU in their research on the YOLOv4 object detection model. This improvement increased inter-layer density, improved feature propagation, and promoted feature reuse and fusion, thereby improving detection speed. The accuracy rate of this method is 94.7%, with a detection time of 0.467 s. In addition, other studies [91,96,97] have achieved fruit target detection by deepening the network model and processing the dataset to identify regions of interest.

2.2.3. A 3D Reconstruction Algorithm for Object Models

The goal of a 3D reconstruction algorithm is to convert 2D images or point cloud data into 3D object models. The target three-dimensional reconstruction algorithms in crop harvesting recognition and localization are shown in Table 4.

The Norwegian University of Life Sciences [9] identified harvestable strawberries by fitting the 3D point cloud to a plane. As shown in Figure 3e, they used coordinate transformation, density-based clustering of base points, and position approximation methods to locate segmented strawberries. This algorithm can accurately determine whether the strawberries are within the safe harvest area, achieving a recognition accuracy of 74.1%.

Nanjing Agricultural University [99] proposed a fast reconstruction method for greenhouse tomato plants. They used the nearest point iteration algorithm for point cloud registration from multiple viewpoints, which has high accuracy and stability. The accuracy rate of this method is 85.49%. Shanghai Jiao Tong University [70] utilized a multispectral camera and a structured camera to reconstruct the upper surface of apple targets in 3D. Figure 3g–j shows the schematic diagram of stem and calyx recognition results. The height information of each pixel can be calculated using triangulation, providing a reference for identifying the picking points of apple stems, with a recognition accuracy of 97.5%. Norwegian University of Science and Technology, Trondheim [100], presented a more real-time and accurate 3D reconstruction algorithm based on ICP. This method can generate 3D point clouds with a resolution of 1 mm and achieve optimal results through 3D registration and reference trajectory optimization on a linear least squares optimizer. The GPU implementation of this method is 33 times faster than the CPU, providing a reference for the recognition and localization of fruit and vegetable harvesting robots.

3. The Challenges of Machine Vision in Recognition and Localization for Fruit and Vegetable Harvesting Robots

Agricultural environments have distinct differences compared to industrial and urban environments. Intelligent agriculture requires a greater number of fruit and vegetable harvesting robots to accomplish tasks, which places higher demands on robots with autonomous capabilities. Machine vision has become an indispensable component of fruit and vegetable harvesting robots and has been a recent research focus [101,102]. However, existing studies indicate that the recognition and localization systems of fruit and vegetable harvesting robots have not achieved optimal results and are influenced by various factors. Robots working in complex environments face situations where different fruits exhibit varying growth characteristics, overlapping growth with branches and leaves, or growing in clusters or independently. Additionally, adverse weather conditions such as strong winds and fluctuations in lighting can also affect the accuracy of localization [33,75]. Therefore, it is not feasible to directly apply recognition and localization solutions from industrial environments to fruit and vegetable harvesting robots, as higher precision is required for recognition and localization. This section will briefly introduce the current status of machine vision in recognition and localization for fruit and vegetable harvesting robots, and emphasize the significant challenges faced.

3.1. The Current Status of Machine Vision in Recognition, Localization, and Harvesting for Fruit and Vegetable Harvesting Robots

3.1.1. Recognition and Localization of Machine Vision in Greenhouse Environments

Compared to other agricultural applications, greenhouses have the highest level of structure, and the growth patterns of fruits are more regular. Fruit and vegetable harvesting robots in greenhouses primarily deal with fruits such as cherries, tomatoes, strawberries, bell peppers, cucumbers, and cherry tomatoes, as shown in Figure 3f. Even within the same variety, there may be subtle morphological differences among the fruits. Some fruits have overlapping shapes, prefer to grow in clusters, and exhibit irregularities in growth conditions.

For the interconnected growth environment of fruits and vegetables in greenhouses, factors such as inconsistent lighting conditions, irregular backgrounds, interference from fruits and leaves, and fruit overlapping can affect the recognition and localization performance of fruit and vegetable harvesting robots. In highly structured greenhouses, it is feasible to achieve the recognition and localization function of the robots by constructing real-time or pre-built 3D maps. However, this requires a high level of understanding of the greenhouse’s structure and crop growth status.

3.1.2. Recognition and Localization of Machine Vision in Outdoor Greenhouse Environments

The harvesting operations in outdoor greenhouses are greatly influenced by weather conditions. In windy environments, leaves and fruits sway with the wind, making recognition and localization challenging. To address this issue, the swinging method has been introduced [33]. Additionally, variations in lighting conditions also affect fruit and vegetable harvesting. Under different light intensities and angles, the texture features of fruits and vegetables may not be distinct, their shapes are irregular, and they are spatially scattered within the orchard. Visual sensors are typically used in top-down or bottom-up photography modes but may result in incomplete fruit coverage in the field of view. The targets of fruit and vegetable harvesting robots include apples, pears, peaches, hawthorns, kiwis, grapes, lychees, citrus fruits, bananas, lemons, bamboo shoots, and tea leaves.

Figure 2P shows the actual harvest situation of an apple picking robot in an orchard [48]. The harvesting robot in the orchard needs to identify apples that are obscured by tree branches or other fruits. Due to the presence of obstructions, recognition and localization can easily fail, and it can also lead to damage to the end effector and mechanical harvesting arm, resulting in failed harvesting operations. Therefore, in the orchard, the apple picking robot needs to automatically recognize apples that are grabbable and ungrabbable. Wang [103] developed a shaking-type apple robot that moves around the bottom area of the apple tree trunk and uses high-frequency vibrations to make apples fall onto a collection device. However, during the process of apple dropping, collisions may occur between apples and between apples and the tree trunk, causing severe damage to the fruits.

3.2. The Significant Challenges Faced by Machine Vision in Recognition and Localization for Fruit and Vegetable Harvesting Robots

Fruit and vegetable harvesting robots face challenges in terms of accuracy, speed, and stability in complex working environments. These challenges include the precision and stability of real-time recognition under complex background interference, the robustness of crop recognition in different environments, the impact of fruit overlapping and obstruction from leaves and branches on recognition and localization, as well as the reliance on prior information. Additionally, the complex changes in the work environment bring about uncertainties. As a result, more and more researchers have shown interest in this field. Figure 7 illustrates the characteristics of complex agricultural environments. Table 5 introduces the challenges and solutions in recognition and localization for fruit and vegetable harvesting robots. Figure 8E shows the research proportion of the main challenges in fruit and vegetable picking robot recognition and localization.

3.2.1. The Stability of Fast Recognition under Complex Background Interference

There are multiple influencing factors in the same agricultural environment. Even for the human eye, it is difficult to distinguish fruits hidden in a green background. So, how can machines perform recognition and localization [10,177]? When the color of the fruits is very similar to the leaves and branches, machine-vision-based fruit and vegetable harvesting robots are prone to missing detections or confusing fruits with leaves. This often leads to a decrease in recognition accuracy. Figure 8C shows the research quantity of various solution methods obtained in this study under the condition of complex background interference.

(1) Deep learning technology. Deep learning techniques can achieve stable recognition in complex backgrounds [104,105]. These studies propose methods based on deep learning techniques to detect apples and coconuts at different growth stages and adapt to complex environments in orchards. Wen [106] proposed an improved YOLOv4 model that achieved citrus detection in complex backgrounds by pruning unimportant channels or layers in the network. The accuracy rate of this method is 89.23%, with a detection time of 0.06 s. Chu [107] suppressed non-apple features by adding a suppression branch to Mask R-CNN, which only operates when there are significant color differences between fruits and leaves. The accuracy rate of this method is 90.5%, with a detection time of 0.25 s. Xiong [108] proposed a method for citrus detection in complex backgrounds based on YOLOv3, with an accuracy of 90.75%. Jia [109] and Xiong [108] combined ResNet and DenseNet as feature extraction networks to improve the accuracy of object detection in complex backgrounds. The literature [110] focuses on the rapid detection of citrus in complex backgrounds. Zhang [111] introduced a bottom-up feature network on top of the FCOS network to optimize the recognition and localization of green apples in complex backgrounds. The accuracy rate of this method is 85.6%. Ji [112] used an SVM classifier. The accuracy rate was 89%, and it took 0.352 s. Xiong [113] employed the Faster R-CNN model for green citrus identification but could not guarantee real-time detection in complex backgrounds. The accuracy rate of this method was 85.49%. Liu [114] utilized an improved Mask R-CNN to identify cucumber fruits in a similar background. The accuracy rate was 89.47%, and it took 0.346 s.

(2) Based on color features. Some scholars have utilized color features to address the challenges of recognition and localization in complex backgrounds. Liu [115] determined the target area using color features. The accuracy rate of this method is 43.9%, with a detection time of 0.017 s. Based on the Otsu threshold algorithm, references [116,117] employed color features for image segmentation, thereby enhancing the recognition and localization capabilities in complex backgrounds. Arefi [118] extracted ripe tomatoes by removing the background in the color space.

(3) Limitations of color features. For green fruits and vegetables in natural environments, color features have certain limitations and instability, which leads to difficulties in recognition due to background similarity [119,120]. Mature cucumbers, for example, have a color similar to that of tree branches and leaf backgrounds, posing challenges for harvesting robots to accurately detect cucumbers based on color features alone [115,121,122]. The accuracy rate of this method is 89.47%, with a detection time of 0.346 s. Considering the similarity and difficulty in distinguishing apples from the background based on color, Bargoti [123] extracted apple regions from the image and employed watershed segmentation and circular Hough transform to identify apple targets. The accuracy rate of this method is 86.1%.

(4) Based on spatial relationships. Some scholars have solved the problem of recognizing and locating in complex backgrounds by utilizing the positional relationships. Xiong [124] located the harvesting points based on the positional relationship between the lychee and the stem in complex backgrounds, combined with line detection. Zhuang [125] used the Harris corner detection method to identify the peduncle region in images of lychees and branches and then located the harvesting points based on the positional relationship between the corner points and the centroid of the fruit. Benavides [126] located the harvesting points based on the positional relationship between the tomato’s pose, centroid, and peduncle. The accuracy rate of this method is 80.8%, with a detection time of 0.03 s.

(5) Removing background interference. Removing the background can also solve the problem of recognizing and locating in complex backgrounds. Refs [118,127,128,129,130,131] employed traditional thresholding methods to remove the image background, but fixed thresholds cannot adapt to changes in illumination, making it ineffective for segmenting pixels with similar colors to the fruit and background. Xiong [132] used an improved fuzzy clustering algorithm to remove the background from lychee cluster images. The accuracy rate of this method is 93.75%, with a detection time of 0.516 s. Wang [133] employed a natural statistical visual attention model to remove the background. The accuracy rate of this method is 89.63%, with a detection time of 0.343 s. Fu [134] eliminated part of the background in the HSV space and combined support vector machines with local binary pattern features and a histogram of banana-oriented gradient features to locate the banana region. The accuracy rate of this method is 91.6%.

3.2.2. Identifying Stability under Different Lighting Conditions for the Same Crop

The same crop exhibits different shapes and texture features under different lighting conditions, which can affect its recognition and localization. In order to achieve all-weather harvesting operations, researchers have conducted studies on the recognition and localization of harvesting robots under nighttime environments [140,178]. Compared to natural light environments, nighttime environments can avoid changes in factors such as natural light angles and shadows [132]. Figure 8A shows the research quantity of various solution methods obtained in this study under different lighting conditions.

(1) Research in nighttime environments. Xiong [78] designed a system for locating mature lychee picking points in nighttime natural environments. Fu [138] conducted recognition research on kiwifruits in nighttime environments, achieving an accuracy rate of 88.3% by recognizing kiwifruit images in the R-G color space. Liang [49] used the YOLOv3 model for lychee fruit detection in nighttime natural environments but with lower accuracy. Ronneberger [139] utilized the U-Net network for the segmentation of lychee main stems in ROI images. Xiong [132] removed the background of nighttime images using a fuzzy clustering algorithm and then used the Otsu algorithm to segment the fruits. The accuracy rate of this method is 93.75%, with a detection time of 0.516 s. He [135] collected 2000 nighttime tomato images as training samples and utilized an improved YOLOv5 model for rapid recognition of tomatoes and normal operation of greenhouse nighttime harvesting robots. The accuracy rate of this method is 96.2%. References [136,137] combined color images with thermal images to identify green apples in nighttime environments. The accuracy rate of this method is 74%. However, thermal imaging methods have lower accuracy in nighttime recognition and cannot meet the requirements of all-weather operation for fruit and vegetable harvesting robots.

(2) Adding light sources. Some scholars have addressed the recognition and localization issues in nighttime environments by adding additional light sources to create natural lighting conditions. Zhao [140] used two incandescent lamps as illumination sources to minimize light interference from different angles and corrected the high-reflection areas on the surface of apples to obtain segmentation results through secondary segmentation. However, due to uneven lighting and other factors, complete and accurate identification of apples cannot be guaranteed. Liu [141] utilized a vision system with incandescent lamps to acquire apple images at night, applied two back-propagation neural networks for apple fruit recognition, and improved the detection results, effectively enhancing the accuracy of nighttime detection. As shown in Figure 2Q,R, Liang [49] employed a lighting system with a three-lamp layout to capture nighttime tomato plant images in natural environments, and used a pulse-coupled neural network with information entropy gradient for segmentation, achieving an accuracy of 67.79%. However, this method faces challenges in adapting to lighting changes and robustness. Oka [142] utilized differences in light reflection on the surfaces of fruits and leaves to identify bell peppers in a greenhouse equipped with LED lights, using intensity, saturation, and chromaticity thresholds, but its applicability is limited and mainly suitable for weak lighting conditions. The accuracy rate of this method is 80.8%. Liu [141] designed a segmentation experiment for nighttime apple images, but significant errors could arise in the presence of many shadows at night.

(3) Removing shadows. Some scholars have conducted research on shadow removal, which involves two basic stages: (a) foreground object recognition and shadow area detection; (b) shadow removal from the image [143]. Manual labeling of shadow areas [144,145] is not suitable for shadow detection of fruits and vegetables in the natural environment of harvesting robots. Xiong [143] used the superpixel segmentation method to determine whether the orchard image was in the shadow area and performed shadow removal using Finlayson’s two-dimensional integral algorithm to obtain the image after shadow removal. The accuracy rate of this method is 83.16%. Liu [146] identified pixels that met the threshold condition as the difference seed points between shadow and non-shadow areas, thus detecting complete shadow areas. Qu [148] proposed a color intrinsic image decomposition method based on orthogonal projection, which effectively addresses the problem of local shadows. Other methods for shadow removal can be found in references [147,149,150,151,152,153,154,155].

(4) Research under natural lighting conditions. In natural lighting conditions, many scholars have conducted extensive research. Bac [63] designed a pepper recognition method that uses a hyperspectral camera to capture plant features. However, due to the variability of natural light and the differences in lighting angles between plants, the accuracy is only 59%. Guo [156] proposed a lychee image recognition method specifically for lychee fruits and their mother branches in natural lighting environments. The accuracy rate of this method is 91.67%. Xiong [157] used Retinex image enhancement and H-component rotation to segment lychees under natural lighting conditions. Peng [158] presented a lychee recognition method based on the bivariate Otsu algorithm under natural lighting conditions. The accuracy rate of this method is 94.75%, with a detection time of 0.2 s. Zhao [96] utilized an improved YOLOv3 algorithm for detection in different lighting directions in natural environments. The accuracy rate of this method is 87.71%, with a detection time of 0.105 s. Ding [159] achieved the detection of green oranges under natural lighting using hyperspectral images, but the computational complexity is too high to meet the real-time detection requirements of harvesting robots. Liu et al. [11] proposed an improved DenseNet network that achieved an accuracy of 91.26% in detecting ripe tomatoes under natural lighting interference.

(5) Handling uneven lighting. Uneven lighting in natural environments can affect the accurate visual positioning of harvesting robots [49]. To improve the robot’s perception ability in three-dimensional space, Zhuang [160] employed a block-based local homomorphic filtering algorithm to eliminate the influence of uneven lighting distribution. The accuracy rate of this method is 86%.

3.2.3. The Dependence of Recognition and Localization Functions on Prior Information in the Case of Overlapping Fruits and Occluded Leaves and Branches

Regarding the positional relationship between fruits and obstacles in fruit harvesting robots, four types of occlusion can be defined: fruit occluded by branches, fruit occluded by leaves, overlapping fruits, and normal unoccluded fruits. In the case of occlusion, shape deficiency is prone to occur in fruit detection, and the model detection relies more on prior information. Currently, there are four main methods for detecting occluded fruits. Firstly, directly detecting images of occluded and overlapping fruits. Secondly, performing classification recognition of obstacles and unoccluded fruits [12]. Thirdly, conducting image restoration by repairing occluded fruits before recognition. This method selects the deleted objects and fills the deleted regions based on the edge information of the objects, and the detected positions are only the positions of the restored fruits in the image. Lastly, it detects occluded and overlapping fruits through calculation and multiple sensors. Figure 8B shows the research quantity of various solution methods obtained in this study under fruit occlusion conditions.

(1) Directly detecting obstructed and overlapping fruit images. Seyed [161] utilized an improved convolutional neural network to detect occluded fruits and applied it in fruit-picking robots. The accuracy rate of this method is 99.8%, with a detection time of 0.008 s. However, this method fails to effectively recognize fruits with larger occlusion areas or lower recognition rates. Inkyu Sa [1] proposed a new multimodal Faster R-CNN model for recognizing occluded fruits, achieving higher F1 scores. However, the network structure of this model is overly complex, occupying large memory and runtime. Yan [91] successfully identified overlapping strawberry image regions by transferring the Mask R-CNN model, with a segmentation accuracy of 89.5%. Xu [162] proposed a method based on a strawberry histogram of oriented gradients (HOGs) descriptors and a support vector machine (SVM) classifier, achieving an accuracy of 87%. However, this classifier can only correctly detect slightly overlapping strawberries.

(2) Classifying and recognizing obstacles and unobstructed fruit. Yan [91] employed an improved YOLOv5 model to effectively recognize apples covered by leaves and apples that are unharvestable due to being occluded by branches or other fruits. The accuracy rate of this method is 91.48%, with a detection time of 0.015 s. Meanwhile, Yang [43] utilized the YOLOv3 model to achieve classification recognition of obstacles and harvested fruits. By combining it with the Kinect V2 structured camera, depth information of citrus fruits and obstacles was obtained, enabling accurate positioning of citrus fruits and obstacles. Additionally, Lin [163] performed region extraction of grapefruit using Principal Component Analysis, which allowed for the initial separation of overlapping grapefruits and precise separation of overlapping grapefruit targets and locating their stems. The accuracy rate of this method is 94.02%. Moreover, Lv [160] proposed a method for identifying occluded apples in natural environments using dynamic threshold segmentation for image segmentation of occluded apples.

(3) Image restoration. A convolutional autoencoder-based image restoration approach was proposed [179]. Firstly, the obstacles were encoded, and then the general shape of the occluded fruits was determined. A comparison was made between the general shape and the encoded part to identify the area to be repaired. Subsequently, the pixels in the identified area were filled to achieve image restoration and recognition of the occluded image. The accuracy rate of this method is 95.96%. Abdulla [164] presented an instance-based image restoration algorithm with a structural similarity reaching 99.3%. However, this method requires a large number of experimental images as source pixel libraries, which has certain limitations. Hedjazi [165] introduced an adversarial network-based image restoration method that optimizes the parameters of the generator and discriminator in an end-to-end manner to enhance the realism of generated image textures. However, this method does not offer ideal edge restoration for occluded areas. PL Arun [166] proposed a non-linear Gaussian bilateral filtering image restoration method based on Bayesian conditional probability. Compared to the latest techniques, the detection accuracy using Bayesian conditional probability improved by 20% while reducing computation time by 32%. However, this method lacks robustness and has poor edge information processing capability for the occluded regions. In fruit detection methods using image restoration, there are still many issues, such as unclear edges in the repaired areas and poor image quality in the restoration. If the images in the dataset are highly similar, the loss of information in the network can lead to degradation of the output results in the compression phase.

(4) Computation and multi-sensor detection. Lv [167] calculated the distance of each fruit in the connected region and extracted valid peaks from a smooth curve to determine the shape of overlapping apples based on the number of peaks. Tian [168] used depth images and corresponding RGB information to locate the center and radius of apples, fitting the target regions accordingly. This method addresses the issue of overlapping fruits but has poor robustness in complex environments. The accuracy rate of this method is 96.61%. Liang [169] calculated the position of the stem based on the relationship between tomato clusters and stems, taking the corner between the first fruit on the tomato cluster and the main stem as the picking point. The accuracy rate of this method is 90%. Yan [180] achieved recognition of occluded small tomatoes using hue and curvature. The method detects red areas in the image, extracts the contours of small tomatoes, and calculates the curvature for target recognition. However, this method is greatly affected by lighting conditions and achieves a recognition rate of only 78.8%. Lin [170] proposed a method to solve occlusion problems by capturing images using multiple cameras, successfully capturing objects in occluded areas.

3.2.4. Uncertainty in Fruit Picking Due to Complex Work Environments

Fruit and vegetable picking robots encounter various situations during picking operations in natural environments, such as repeated picking behavior for hard-to-pick target fruits and redundant picking behavior caused by incorrect identification. Picking robot systems can only compensate for the original errors caused by visual hardware, and it is difficult to compensate for unknown random errors. The visual fault tolerance rate is low. Figure 8D shows the research quantity of various solution methods obtained in this study under harvest uncertainty conditions.

(1) Reducing overall vibrations. Some picking robots experience overall vibrations during their work process, leading to identification and positioning failures [171,172], and lens distortion causing inaccurate image imaging. Shiigi [173] developed a three-armed picking robot with a suction gripper as the end effector. After fixing the fruit with the suction device, the fruit stalk is grasped with two fingers and then cut on both sides according to the tilt angle. However, the finger grasping process introduces uncertain mechanical vibrations, which leads to incorrect identification of the fruit stem tilt angle by the visual system. The accuracy rate of this method is 38%. Han [50] designed an end effector for cutting strawberry stems. As shown in Figure 2S, a monocular camera, cutting device, and fixture are installed at the end to compensate for the position changes caused by the bending effect of the end effector. Liu [174] designed a novel end effector. When picking strawberries, the end effector is equipped with internal sensors that can sense and compensate for the positioning errors in the machine vision system, and have robustness to the positioning errors introduced by the visual module.

(2) Sensor interference. Hayashi [175] developed a strawberry picking robot that uses photoelectric sensors to detect the attraction device for securing the fruit and the picked fruits. However, photoelectric sensors can have an impact on the visual system and cannot accurately identify strawberries with ripeness greater than 80% and strawberry stems. The success rate of a method in picking is only 38.1%, with a single picking time of 11.5 s.

(3) Establishing a fault-tolerant mathematical model. Zou [176] solved the problem of identification and positioning errors by establishing a fault-tolerant mathematical model and binocular vision mechanism, thereby increasing the applicability of error tolerance. The success rate of picking lychee and citrus using the method is 78%.

4. Conclusions

(1) Various visual sensors have different advantages and disadvantages in different environments. Monocular cameras have limitations in obtaining depth information, such as high complexity, long computation time, and poor stability in complex environments. Additionally, monocular cameras cannot be used in dark or low-light conditions, and they are sensitive to lighting conditions. When it comes to motion planning, if the information provided by the sensor is not accurate enough, stereo vision can provide higher detection accuracy. However, the drawback of stereo vision is that it requires high sensor calibration accuracy, long computation time, and is easily affected by lighting conditions, thus affecting picking efficiency. Although structured cameras have good positioning accuracy, in outdoor environments, sunlight can overpower most of the infrared images, making it difficult to achieve high precision and resulting in higher costs. In comparison, multispectral cameras are less susceptible to environmental interference, but they require a large amount of computation for image processing, making them unsuitable for real-time picking operations. The affordability of 3D perception systems is a major limiting factor for precise positioning.

(2) Various machine vision algorithms have their limitations. Fruit and vegetable picking robots are subject to many uncertainties during the picking process, such as complex backgrounds, different lighting environments, and uncertainties caused by complex working conditions. Traditional image segmentation algorithms are sensitive to lighting changes in natural environments. In contrast, machine-vision-based segmentation algorithms can automatically learn rules from a large amount of data and achieve better segmentation results. Machine-learning-based segmentation algorithms have an improvement of nearly 5 percentage points in detection accuracy compared to traditional segmentation algorithms. The improved object detection models achieve high recognition accuracy, with the highest average detection precision and fastest detection speed. The method involving the addition of residual module ResNet has shown good results and has been widely applied, with detection time ranging from 0.017 to 0.093 s and detection accuracy ranging from 94.44% to 97.07%. However, this model is influenced by various factors such as environmental changes and the growth conditions and characteristics of fruits and vegetables. The more factors considered by the object detection algorithm, the more complex the model structure becomes. Three-dimensional reconstruction algorithms require extensive computation and longer detection time, and there are significant differences among different reconstruction methods, with the highest reaching 97.5% and the lowest reaching 74.1%.

(3) Developing advanced image processing algorithms to accurately identify and locate fruits in complex environments remains a challenge. In recent years, various visual sensors have drawn increasing attention from researchers. The core application of these sensors lies in how to utilize the collected data. However, there is still a lack of datasets considering the different growth stages of fruits and the individual differences between them. Therefore, from the perspective of raw image data, expanding the dataset of collected fruit and vegetable images or extracting distinctive features of fruits through information processing techniques can be considered for objective region determination. Decision fusion is then applied to handle complex background interference, different lighting conditions, as well as overlapping fruits, branches, and trunk obstructions, thereby reducing the data requirements and the number of machine vision algorithms. Among the surveyed literature, studies on complex background interference account for 42.06% of the research. Existing studies have demonstrated good detection performance in cases of both complex background interference and target obstruction interference. The recognition and positioning of fruits and vegetables are most affected by different lighting conditions, with a minimum detection accuracy of 59.2%.

(4) The accuracy of target positioning and the superiority of the robotic arm end effector and its control system have a significant impact on the harvesting rate of fruit and vegetable picking robots. Among the surveyed literature, the success rate of picking ranges from 38% to 78%. The end effector can only compensate for the original errors caused by the visual hardware. It is challenging to compensate for unknown random errors and obtain and compensate for these error patterns. When the target fruits grow in clusters or are obstructed by branches and other interfering objects, significant random errors can occur in the recognition and positioning system, leading to unstable recognition and positioning and subsequently causing grasping failure by the end effector. This may result in fruit damage and damage to critical components of the fruit and vegetable picking robot. Therefore, the cooperation between machine vision and mechanical fault tolerance has recently become a hot research topic. Traditional vision methods do not consider mechanical or vision-related random errors. Existing systems can only compensate for the original errors caused by visual hardware, such as inaccurate image formation caused by camera lens distortion. It is difficult to compensate for unknown random errors. In the research on fruit and vegetable picking robots, it is necessary to explore fault tolerance theory to achieve precise positioning of target fruits.

5. Future Prospects

Target recognition and 3D positioning of picking robots have always been hot topics in smart agriculture research. Compared to traditional recognition and positioning methods, deep-learning-based neural network algorithms are worth more attention. Based on the current development status of target recognition and positioning research for machine-vision-based picking robots, the following aspects still deserve attention:

(1) Utilize multi-sensor fusion to obtain rich information and develop more general algorithms. Existing machine vision algorithms are trained by inputting specific picking object images into the network model. However, the targets to be picked exist in various environments, such as random organisms within the field of view, and the algorithm’s fault tolerance and stability are average. To address this issue, it is essential to focus on using multi-sensor fusion for fruit and vegetable recognition and localization to complement each other’s limitations. Additionally, future research should make the algorithm more versatile and capable of deriving the ability to recognize the environmental features based on the objects being picked.

(2) Enhancing the feature extraction capability of the model through the collection of sample images from different stages. Existing machine vision algorithms require a large number of sample training sets. However, in nature, target fruits exhibit different features at different growth stages. It is possible to collect sample images of fruits and vegetables at various stages and use the test set for each input as the training set for the network at the same time, in order to deepen the weight learning of the network. Extracting more representative features can be achieved through an attention mechanism module, which requires the network structure to have strong feature extraction capabilities.

(3) Consider reducing the complexity and cost of recognition, localization, and picking from the perspective of horticultural operations and fruit and vegetable growth patterns. The main reasons for large deviations between the actual positioning of target fruits and the results obtained are identification errors, mechanical vibrations, and inaccurate distance measurements from visual sensors. In the future, efforts can be made in horticultural operations, including tree pruning, pollination, and thinning. By taking advantage of the regularity of fruit and vegetable growth, it may become a breakthrough for improving the accuracy of localization through machine vision. By improving the visibility of target fruits and reducing obstructions and clustering, the target fruits can be presented more clearly to the sensors. This will reduce a significant amount of noise and easily obtain precise three-dimensional positioning of the targets. In addition, when the targets are clearly visible, the end effector of the picking robot can tolerate more positional deviations, fundamentally reducing the complexity and cost of the three-dimensional positioning system.

(4) The increase in large models has opened up new possibilities for fruit and vegetable picking robots. Large model computations require high-performance hardware processors, and it is not feasible to directly install large processors on picking robots for practical picking operations. In the future, it will be possible to access cloud-based large model algorithms through API interfaces and utilize the intelligence of large models for continuous picking and positioning decisions. At the same time, the optimization of large model algorithms can be carried out in combination with visual sensors to continuously locate picking points and determine picking paths in different stages and distances, enabling fruit and vegetable picking operations in various scenarios.

Author Contributions

Conceptualization, R.N. and M.J.; methodology, G.H.; investigation, H.C.; resources, G.H. and M.J.; data curation, M.J.; writing—original draft preparation, G.H.; writing—review and editing, G.H.; visualization, G.H. and R.N.; project administration, G.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the following funds: the Strategic Priority Research Program of the Chinese Academy of Sciences (grant no.: XDA28120000 and XDA28040000); the Subproject of the National Key R&D Program (grant no.: 2022YFD2001404-01); the Natural Science Foundation of Shandong Province (grant no.: ZR2021MF094); the Key R&D Plan of Shandong Province (grant no.: 2020CXGC010804); the Central Leading Local Science and Technology Development Special Fund Project (grant no.: YDZX2021122); and the Science & Technology Specific Projects in Agricultural High-tech Industrial Demonstration Area of the Yellow River Delta (grant no.: 2022SZX11).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sa, I.; Zong, G.; Feras, D.; Ben, U.; Tristan, P.; Chris, M.C. Deepfruits: A Fruit Detection System Using Deep Neural Networks. Sensors 2016, 16, 1222. [Google Scholar] [CrossRef]
Zheng, T.; Jiang, M.; Feng, M. Research overview of visual-based target recognition and localization methods for harvesting robots. J. Instrum. Instrum. 2021, 42, 28–51. [Google Scholar]
Ruan, S.J.; Chen, J.H. Title of presentation. In Proceedings of the 2022 IEEE 4th Global Conference on Life Sciences and Technologies, Osaka, Japan, 7–9 March 2022; pp. 431–432. [Google Scholar]
Luo, G. Depth Perception and 3D Reconstruction Based on Binocular Stereo Vision. Ph.D. Thesis, Central South University, Changsha, China, 2012; pp. 6–15. [Google Scholar]
Anwar, I.; Lee, S. High performance stand-alone structured light 3D camera for smart manipulators. In Proceedings of the 2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), Jeju, Republic of Korea, 28 June–1 July 2017; pp. 192–195. [Google Scholar]
Zhang, B.H.; Huang, W.Q.; Li, J.B. Principles, developments and applications of computer vision for external quality inspection of fruits and vegetables: A review. Food Res. Int. 2014, 62, 326–343. [Google Scholar] [CrossRef]
Xu, P.; Fang, N.; Liu, N. Visual recognition of cherry tomatoes in plant factory based on improved deep instance segmentation. Comput. Electron. Agric. 2022, 197, 106991. [Google Scholar] [CrossRef]
Xiao, X.; Huang, J.; Li, M. Fast recognition method for citrus under complex environments based on improved YOLOv3. J. Eng. 2022, 2022, 148–159. [Google Scholar] [CrossRef]
Ge, Y.; Xiong, Y.; Tenorio, G.L. Fruit localization and environment perception for strawberry harvesting robots. IEEE Access 2019, 7, 147642–147652. [Google Scholar] [CrossRef]
Liu, J. Research progress analysis of robotic harvesting technologies in greenhouse. Trans. Chin. Soc. Agric. Mach. 2017, 48, 1–18. [Google Scholar]
Liu, J.; Pi, J.; Xia, L. A novel and high precision tomato maturity recognition algorithm based on multi-level deep residual network. Multimed. Tools Appl. 2019, 79, 9403–9417. [Google Scholar] [CrossRef]
Chen, J.; Zhang, H.; Wang, Z. An image restoration and detection method for picking robot based on convolutional auto-encoder. Comput. Electron. Agric. 2022, 196, 106896. [Google Scholar] [CrossRef]
Mehta, S.S.; Ton, C.; Asundi, S.; Burks, T.F. Multiple Camera Fruit Localization Using a Particle Filter. Comput. Electron. Agric. 2017, 142, 139–154. [Google Scholar] [CrossRef]
Hua, X.; Li, H.; Zeng, J.; Han, C.; Chen, T.; Tang, L.; Luo, Y. A review of target recognition technology for fruit picking robots: From digital image processing to deep learning. Appl. Sci. 2023, 13, 4160. [Google Scholar] [CrossRef]
Xiao, F.; Wang, H.; Xu, Y.; Zhang, R. Fruit detection and recognition based on deep learning for automatic harvesting: An overview and review. Agronomy 2023, 13, 1625. [Google Scholar] [CrossRef]
Li, Y.; Feng, Q.; Li, T.; Xie, F.; Liu, C.; Xiong, Z. Advance of target visual information acquisition technology for fresh fruit robotic harvesting: A review. Agronomy 2022, 12, 1336. [Google Scholar] [CrossRef]
Wang, Z.; Xun, Y.; Wang, Y.; Yang, Q. Review of smart robots for fruit and vegetable picking in agriculture. Int. J. Agric. Biol. Eng. 2022, 15, 33–54. [Google Scholar]
Tang, Y.; Chen, M.; Wang, C. Recognition and localization methods for vision-based fruit picking robots: A review. Front. Plant Sci. 2020, 11, 510. [Google Scholar] [CrossRef] [PubMed]
Jiang, P.; Luo, L.; Zhang, B. Research on Target Localization and Recognition Based on Binocular Vision and Deep Learning with FPGA. J. Phys. Conf. Ser. 2022, 2284, 12009. [Google Scholar] [CrossRef]
Feng, Q.C.; Zou, W.; Fan, P.F.; Zhang, C.F.; Wang, X. Design and Test of Robotic Harvesting System for Cherry Tomato. Int. J. Agric. Biol. Eng. 2018, 11, 96–100. [Google Scholar] [CrossRef]
Baeten, J.; Kevin, D.; Sven, B.; Wim, B.; Eric, C. Autonomous Fruit Picking Machine: A Robotic Apple Harvester. Field Serv. Robot. 2008, 42, 531–539. [Google Scholar]
Bulanon, D.M.; Kataoka, T.; Ota, Y.A. Machine Vision System for the Apple Harvesting Robot. Agric. Eng. Int. Cigr. Ejournal 2001, 3, 1–11. [Google Scholar]
Zhao, J.; Tow, J.; Katupitiya, J. On-tree Fruit Recognition Using Texture Properties and Color Data. In Proceedings of the 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, Edmonton, AB, Canada, 2–6 August 2005; pp. 263–268. [Google Scholar]
Mehta, S.S.; Burks, T.F. Vision-based Control of Robotic Manipulator for Citrus Harvesting. Comput. Electron. Agric. 2014, 102, 146–158. [Google Scholar] [CrossRef]
Meng, L.; Yuan, L.; Qing, L.W. A Calibration Method for Mobile Omnidirectional Vision Based on Structured Light. IEEE Sens. J. 2021, 21, 11451–11460. [Google Scholar] [CrossRef]
Cao, K.; Liu, R.; Wang, Z.; Peng, K.; Zhang, J.; Zheng, J.; Teng, Z.; Yang, K.; Stiefelhagen, R. Tightly-coupled liDAR-visual SLAM based on geometric features for mobile agents. arXiv 2023, arXiv:2307.07763. [Google Scholar]
Shu, C.F.; Luo, Y.T. Multi-modal feature constraint based tightly coupled monocular Visual-liDAR odometry and mapping. IEEE Trans. Intell. Veh. 2023, 8, 3384–3393. [Google Scholar] [CrossRef]
Zhang, L.; Yu, X.; Adu-Gyamfi, Y.; Sun, C. Spatio-temporal fusion of LiDAR and camera data for omnidirectional depth perception. Transp. Res. Rec. 2023, 1. [Google Scholar] [CrossRef]
Cheng, X.; Qiu, S.; Zou, Z.; Pu, J.; Xue, X. Understanding depth map progressively: Adaptive distance interval separation for monocular 3D object detection. arXiv 2023, arXiv:2306.10921. [Google Scholar]
Ma, R.; Yin, Y.; Chen, J.; Chang, R. Multi-modal information fusion for liDAR-based 3D object detection framework. Multimed. Tools Appl. 2023, 13, 1731. [Google Scholar] [CrossRef]
Guo, S.; Guo, J.; Bai, C. Semi-Direct Visual Odometry Based on Monocular Depth Estimation. In Proceedings of the 2019 IEEE International Conference on Unmanned Systems (ICUS), Beijing, China, 17–19 October 2019; pp. 720–724. [Google Scholar]
Edan, Y.; Rogozin, D.; Flash, T. Robotic melon harvesting. IEEE Trans. Robot. Autom. 2000, 16, 831–834. [Google Scholar] [CrossRef]
Xiong, J.; He, Z.; Lin, R.; Liu, Z.; Bu, R.; Yang, Z.; Peng, H.; Zou, X. Visual Positioning Technology of Picking Robots for Dynamic Litchi Clusters with Disturbance. Comput. Electron. Agric. 2018, 151, 226–237. [Google Scholar] [CrossRef]
Wang, H.; Mao, W.; Liu, G.; Hu, X.; Li, S. Recognition and positioning of apple harvesting robot based on visual fusion. J. Agric. Mach. 2012, 43, 165–170. [Google Scholar]
Mrovlje, J.; Vrancic, D. Distance measuring based on stereoscopic pictures. In Proceedings of the 9th International PhD Workshop on Systems and Control, Izola, Slovenia, 1–3 October 2008. [Google Scholar]
Pal, B.; Khaiyum, S.; Kumaraswamy, Y.S. 3D point cloud generation from 2D depth camera images using successive triangulation. In Proceedings of the IEEE International Conference on Innovative Mechanisms for Industry Applications, Bangalore, India, 19–20 May 2017; pp. 129–133. [Google Scholar]
Ji, W.; Meng, X.; Qian, Z.; Xu, B.; Zhao, D. Branch Localization Method Based on the Skeleton Feature Extraction and Stereo Matching for Apple Harvesting Robot. Int. J. Adv. Robot. Syst. 2017, 14, 172988141770527. [Google Scholar] [CrossRef]
Guo, A.; Xiong, J.; Xiao, D.; Zou, X. Calculation and stereo matching of picking points for litchi using fused Harris and SIFT algorithm. J. Agric. Mach. 2015, 46, 11–17. [Google Scholar]
Jiang, H.; Peng, Y.; Ying, Y. Measurement of 3-D locations of ripe tomato by binocular stereo vision for tomato harvesting. In Proceedings of the 2008 ASABE International Meeting, Providence, RI, USA, 29 June–2 July 2008; p. 084880. [Google Scholar]
Van Henten, E.J.; Van Tuijl, B.A.J.; Hoogakker, G.-J.; Van Der Weerd, M.J.; Emming, J.; Kornet, J.G.; Bontsema, J. An Autonomous Robot for De-leafing Cucumber Plants Grown in a High-wire Cultivation System. Biosyst. Eng. 2006, 94, 317–323. [Google Scholar] [CrossRef]
Van Henten, E.J.; Van Tuijl, B.A.J.; Hemming, J.; Kornet, J.G.; Bontsema, J.; Van Os, E.A. Field Test of an Autonomous Cucumber Picking Robot. Biosyst. Eng. 2003, 86, 305–313. [Google Scholar] [CrossRef]
Yoshida, T.; Kawahara, T.; Fukao, T. Fruit Recognition Method for a Harvesting Robot with RGB-D Cameras. Robomech. J. 2022, 9, 1–10. [Google Scholar] [CrossRef]
Yang, C.; Liu, Y.; Wang, Y.; Xiong, L.; Xu, H.; Zhao, W. Research on Recognition and Positioning System for Citrus Harvesting Robot in Natural Environment. Trans. Chin. Soc. Agric. Mach. 2019, 50, 14–22+72. [Google Scholar]
Safren, Q.; Alchanatis, V.; Ostrovsky, V.; Levi, O. Detection of green apples in hyperspectral images of apple-tree foliage using machine vision. Trans. ASABE 2008, 50, 2303–2313. [Google Scholar] [CrossRef]
Okamoto, H.; Lee, W.S. Green citrus detection using hyperspectral imaging. Comput. Electron. Agric. 2010, 66, 201–208. [Google Scholar] [CrossRef]
Zhang, Q.; Chen, J.; Li, B.; Xu, C. Method for identifying and locating the picking points of tomato clusters based on RGB-D information fusion and object detection. Trans. Chin. Soc. Agric. Mach. 2021, 37, 143–152. [Google Scholar]
Mccool, C.; Sa, I.; Dayoub, F. Visual detection of occluded crop: For automated harvesting. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–19 May 2016. [Google Scholar]
Yan, B.; Fan, P.; Lei, X. A real-time apple targets detection method for picking robot based on improved YOLOv5. Remote Sens. 2021, 13, 1619. [Google Scholar] [CrossRef]
Liang, C.; Xiong, J.; Zheng, Z. A visual detection method for nighttime litchi fruits and fruiting stems. Comput. Electron. Agric. 2020, 169, 105192. [Google Scholar] [CrossRef]
Han, K.S.; Kim, S.-C.; Lee, Y.-B.; Kim, S.C.; Im, D.-H.; Choi, H.-K. Strawberry harvesting robot for bench-type cultivation. Biosyst. Eng. 2012, 37, 65–74. [Google Scholar] [CrossRef]
Atif, M.; Lee, S. Adaptive Pattern Resolution for Structured Light 3D Camera System. In Proceedings of the 2018 IEEE SENSORS, New Delhi, India, 28–31 October 2018; pp. 1–4. [Google Scholar]
Weinmann, M.; Schwartz, C.; Ruiters, R.; Klein, R. A Multicamera, Multi-projector Super-Resolution Framework for Structured Light. In Proceedings of the 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, Hangzhou, China, 16–19 May 2011; pp. 397–404. [Google Scholar]
Lee, S.; Atif, M.; Han, K. Stand-Alone Hand-Eye 3D Camera for Smart Modular Manipulator. In Proceedings of the IEEE/RSJ IROS Workshop on Robot Modularity, Daejeon, Republic of Korea, 9–14 October 2016. [Google Scholar]
Hyun, J.S.; Chiu, G.T.-C.; Zhang, S. High-speed and high-accuracy 3D surface measurement using a mechanical projector. Opt. Express 2018, 26, 1474–1487. [Google Scholar] [CrossRef] [PubMed]
Nevatia, R. Depth measurement by motion stereo. Comput. Graph. Image Process. 1976, 5, 203–214. [Google Scholar] [CrossRef]
Subrata, I.D.M.; Fujiura, T.; Nakao, S. 3-D Vision Sensor for Cherry Tomato Harvesting Robot. Jpn. Agric. Res. Q. 1997, 31, 257–264. [Google Scholar]
Wang, Z.; Walsh, K.B.; Verma, B. On-tree mango fruit size estimation using RGB-D images. Sensors 2017, 17, 2738. [Google Scholar] [CrossRef]
Rong, J.; Dai, G.; Wang, P. A peduncle detection method of tomato for autonomous harvesting. Complex Intell. Syst. 2021, 8, 2955–2969. [Google Scholar] [CrossRef]
Zheng, B.; Sun, G.; Meng, Z.; Nan, R. Vegetable Size Measurement Based on Stereo Camera and Keypoints Detection. Sensors 2022, 22, 1617. [Google Scholar] [CrossRef]
Jin, X.; Tang, L.; Li, R.; Zhao, B.; Ji, J.; Ma, Y. Edge Recognition and Reduced Transplantation Loss of Leafy Vegetable Seedlings with Intel RealsSense D415 Depth Camera. Comput. Electron. Agric. 2022, 198, 107030. [Google Scholar] [CrossRef]
Tran, T.M. A Study on Determination of Simple Objects Volume Using ZED Stereo Camera Based on 3D-Points and Segmentation Images. Int. J. Emerg. Trends Eng. Res. 2020, 8, 1990–1995. [Google Scholar] [CrossRef]
Pan, S.; Ahamed, T. Pear Recognition in an Orchard from 3D Stereo Camera Datasets to Develop a Fruit Picking Mechanism Using Mask R-CNN. Sensors 2022, 22, 4187. [Google Scholar] [CrossRef]
Bac, C.W.; Hemming, J.; Van Henten, E.J. Robust pixel-based classification of obstacles for robotic harvesting of sweet-pepper. Comput. Electron. Agric. 2013, 96, 148–162. [Google Scholar] [CrossRef]
Yuan, T.; Li, W.; Feng, Q. Spectral imaging for greenhouse cucumber fruit detection based on binocular stereovision. In Proceedings of the 2010 ASABE International Meeting, Pittsburgh, PA, USA, 20–23 June 2010. [Google Scholar]
Ji, C.; Feng, Q.; Yuan, T. Development and performance analysis of greenhouse cucumber harvesting robot system. Robot 2011, 6, 726–730. [Google Scholar]
Bao, G.; Cai, S.; Qi, L. Multi-template matching algorithm for cucumber recognition in natural environment. Comput. Electron. Agric. 2016, 127, 754–762. [Google Scholar] [CrossRef]
Zheng, T. Research on tomato detection in natural environment based on RC-YOLOv4. Comput. Electron. Agric. 2022, 198, 107029. [Google Scholar] [CrossRef]
Xu, W.; Zhao, L.; Li, J. Detection and classification of tea buds based on deep learning. Comput. Electron. Agric. 2022, 192, 106547. [Google Scholar] [CrossRef]
Zhong, Z.; Xiong, J.; Zheng, Z. A method for litchi picking points calculation in natural environment based on main fruit bearing branch detection. Comput. Electron. Agric. 2021, 189, 106398. [Google Scholar] [CrossRef]
Zhang, B.; Huang, W.; Wang, C. Computer vision recognition of stem and calyx in apples using near-infrared linear-array structured light and 3D reconstruction. Biosyst. Eng. 2015, 139, 25–34. [Google Scholar] [CrossRef]
Feng, Q.; Zhao, C.; Wang, X. Measurement method for targeted measurement of cherry tomato fruit clusters based on visual servoing. Trans. Chin. Soc. Agric. Mach. 2015, 31, 206–212. [Google Scholar]
Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. Robust tomato recognition for robotic harvesting using feature images fusion. Sensors 2016, 16, 173. [Google Scholar] [CrossRef]
Li, Y. Research on Target Recognition and Positioning Technology of Citrus Harvesting Robot Based on Binocular Vision. Master’s Thesis, Chongqing University of Technology, Chongqing, China, 2017. [Google Scholar]
Yan, J.; Wang, P.; Wang, T. Identification and Localization of Optimal Picking Point for Truss Tomato Based on Mask R-CNN and Depth Threshold Segmentation. In Proceedings of the 2021 IEEE 11th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Jiaxing, China, 27–31 May 2021; pp. 899–903. [Google Scholar]
Yang, Q.; Chen, C.; Dai, J. Tracking and recognition algorithm for a robot harvesting oscillating apples. Int. J. Agric. Biol. Eng. 2020, 13, 163–170. [Google Scholar] [CrossRef]
Zhang, J. Target extraction of fruit picking robot vision system. J. Phys. Conf. Ser. 2019, 1423, 012061. [Google Scholar] [CrossRef]
Xiong, J.; Zou, X.; Peng, H. Real-time recognition and picking point determination technology for perturbed citrus harvesting. Trans. Chin. Soc. Agric. Mach. 2014, 45, 38–43. [Google Scholar]
Xiong, J.; Lin, R.; Liu, Z. Recognition technology of litchi picking robot in natural environment at night. Trans. Chin. Soc. Agric. Mach. 2017, 48, 28–34. [Google Scholar]
Zhu, Y.; Zhang, T.; Liu, L. Fast Location of Table Grapes Picking Point Based on Infrared Tube. Inventions 2022, 7, 27. [Google Scholar] [CrossRef]
Wu, F.; Duan, J.; Ai, P. Rachis detection and three-dimensional localization of cut-off point for vision-based banana robot. Comput. Electron. Agric. 2022, 198, 107079. [Google Scholar] [CrossRef]
Silwal, A.; Karkee, M.; Zhang, Q. A hierarchical approach to apple identification for robotic harvesting. Trans. ASABE 2016, 59, 1079–1086. [Google Scholar]
Qi, X.; Dong, J.; Lan, Y. Method for Identifying Litchi Picking Position Based on YOLOv5 and PSPNet. Remote Sens. 2022, 14, 2004. [Google Scholar] [CrossRef]
Feng, Q.; Cheng, W.; Li, Y.; Wang, B.; Chen, L. Localization method of tomato plant pruning points based on Mask R-CNN. Trans. Chin. Soc. Agric. Eng. 2022, 38, 128–135. [Google Scholar]
Feng, Q.; Cheng, W.; Zhang, W. Visual Tracking Method of Tomato Plant Main-Stems for Robotic Harvesting. In Proceedings of the 2021 IEEE 11th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Jiaxing, China, 27–31 July 2021; pp. 886–890. [Google Scholar]
Tafuro, A.; Adewumi, A.; Parsa, S. Strawberry picking point localization, ripeness, and weight estimation. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2295–2302. [Google Scholar]
Yu, Y.; Zhang, K.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on Mask R-CNN. Comput. Electron. Agric. 2019, 163, 104846. [Google Scholar] [CrossRef]
Zhang, X.; Fu, L.; Karkee, M.; Whiting, M.D.; Zhang, Q. Canopy segmentation using ResNet for mechanical harvesting of apples. IFAC-PapersOnLine 2019, 52, 300–305. [Google Scholar] [CrossRef]
Zhang, P.; Liu, X.; Yuan, J. YOLO5-spear: A robust and real-time spear tips locator by improving image augmentation and lightweight network for selective harvesting robot of white asparagus. Biosyst. Eng. 2022, 218, 43–61. [Google Scholar] [CrossRef]
Gai, R.; Chen, N.; Yuan, H. A detection algorithm for cherry fruits based on the improved YOLO-v4 model. Neural Comput. Appl. 2023, 35, 13895–13906. [Google Scholar] [CrossRef]
Cui, Z.; Sun, H.M.; Yu, J.T. Fast detection method of green peach for application of picking robot. Appl. Intell. 2022, 52, 1718–1739. [Google Scholar] [CrossRef]
Peng, H.X.; Huang, B.; Shao, Y.Y. Generalized improved SSD model for multi-class fruit picking target recognition in natural environment. Trans. Agric. Eng. 2018, 34, 155–162. [Google Scholar]
Su, F.; Zhao, Y.; Wang, G. Tomato Maturity Classification Based on SE-YOLOv3-MobileNetV1 Network under Nature Greenhouse Environment. Agronomy 2022, 12, 1638. [Google Scholar] [CrossRef]
Wu, J.; Zhang, S.; Zou, T. A Dense Litchi Target Recognition Algorithm for Large Scenes. Math. Prob. Eng. 2022, 2022, 4648105. [Google Scholar] [CrossRef]
Chen, J.; Wang, Z.; Wu, J. An improved Yolov3 based on dual path network for cherry tomatoes detection. J. Food Process Eng. 2021, 44, 13803. [Google Scholar] [CrossRef]
Ji, W.; Pan, Y.; Xu, B. A real-time Apple targets detection method for picking robot based on ShufflenetV2-YOLOX. Agriculture 2022, 12, 856. [Google Scholar] [CrossRef]
Zhao, D.A.; Wu, R.D.; Liu, X.Y.; Zhao, Y.Y. Localization of Apple Picking Under Complex Background Based on YOLO Deep Convolutional Neural Network. Trans. Chin. Soc. Agric. Eng. 2019, 35, 164–173. [Google Scholar]
Zhang, Q.; Chen, J.M.; Li, B.; Xu, C. Tomato cluster picking point identification based on RGB-D fusion and object detection. Trans. Chin. Soc. Agric. Eng. 2021, 37, 143–152. [Google Scholar]
Peng, H.; Xue, C.; Shao, Y. Litchi detection in the field using an improved YOLOv3 model. Int. J. Agric. Biol. Eng. 2022, 15, 211–220. [Google Scholar] [CrossRef]
Sun, G.; Wang, X. Three-dimensional point cloud reconstruction and morphology measurement method for greenhouse plants based on the Kinect sensor self-calibration. Agronomy 2019, 9, 596. [Google Scholar] [CrossRef]
Isachsen, U.J.; Theoharis, T.; Misimi, E. Fast and accurate GPU-accelerated, high-resolution 3D registration for the robotic 3D reconstruction of compliant food objects. Comput. Electron. Agric. 2021, 180, 105929. [Google Scholar] [CrossRef]
Xu, Z.F.; Jia, R.S.; Liu, Y.B. Fast method of detecting tomatoes in a complex scene for picking robots. IEEE Access 2020, 8, 55289–55299. [Google Scholar] [CrossRef]
Rong, D.; Wang, H.; Ying, Y.; Zhang, Z.; Zhang, Y. Peach variety detection using VIS-NIR spectroscopy and deep learning. Comput. Electron. Agric. 2020, 175, 105553. [Google Scholar] [CrossRef]
Wang, C.; Wan, Y.; Wang, G. Development of control system for a picking robot used in plate flame cutting. Res. Explor. Lab. 2017, 36, 41–44. [Google Scholar]
Tian, Y.N.; Yang, G.D.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
Parvathi, S.; Selvi, S.T. Detection of maturity stages of coconuts in complex background using Faster R-CNN model. Biosyst. Eng. 2021, 202, 119–132. [Google Scholar] [CrossRef]
Changhong, W.; Qiang, L.; Xin, C. Citrus recognition based on YOLOv4 neural network. J. Physics Conf. Ser. 2021, 1820, 012163. [Google Scholar]
Chu, P.Y.; Li, Z.J.; Lammers, K.; Lu, R.F.; Liu, X.M. Deep learning-based apple detection using a suppression mask R-CNN. Pattern Recognit. Lett. 2021, 147, 206–211. [Google Scholar] [CrossRef]
Xiong, J.; Zheng, Z.; Liang, J. Orange recognition method in night environment based on improved YOLO V3 network. J. Agric. Mach. 2020, 51, 199–206. [Google Scholar]
Jia, W.; Tian, Y.; Luo, R. Detection and Segmentation of Overlapped Fruits Based on Optimized Mask R-CNN Application in Apple Harvesting Robot. Comput. Electron. Agric. 2020, 172, 105380. [Google Scholar] [CrossRef]
Bi, S.; Gao, F.; Chen, J. Citrus target recognition method based on deep convolutional neural network. J. Agric. Mach. 2019, 50, 181–186. [Google Scholar]
Zhang, Z.; Jia, W.; Shao, W. Optimization of FCOS network for green apple detection in complex orchard environments. Spectrosc. Spectr. Anal. 2022, 42, 647–653. [Google Scholar]
Ji, W.; Zhao, D.; Cheng, F. Automatic recognition vision system guided for apple harvesting robot. Comput. Electr. Eng. 2012, 38, 1186–1195. [Google Scholar] [CrossRef]
Xiong, J.; Liu, Z.; Tang, L.; Lin, R.; Bu, R.; Peng, H. Research on visual detection technology for green citrus in natural environment. Trans. Chin. Soc. Agric. Mach. 2018, 49, 45–52. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Liu, X.; Zhao, D.; Jia, W. Cucumber fruits detection in greenhouses based on instance segmentation. IEEE Access 2019, 7, 139635–139642. [Google Scholar] [CrossRef]
Lv, J.; Zhao, D. An algorithm for rapid tracking and recognition of target fruit for apple picking robot. Trans. Chin. Soc. Agric. Mach. 2014, 45, 65–72. [Google Scholar]
Wei, X.Q.; Ji, K.; Lan, J.H. Automatic method of fruit object extraction under complex agricultural background for vision system of fruit picking robot. Optik 2014, 125, 5684–5689. [Google Scholar] [CrossRef]
Arefi, A.; Motlagh, A.M.; Mollazade, K.; Teimourlou, R.F.J. Recognition and localization of ripen tomato based on machine vision. Aust. J. Crop Sci. 2011, 5, 1144–1149. [Google Scholar]
He, Z.-L.; Xiong, J.-T.; Lin, R.; Zou, X.; Tang, L.Y.; Yang, Z.G.; Liu, Z.; Song, G. A method of green litchi recognition in natural environment based on improved LDA classifier. Comput. Electron. Agric. 2017, 140, 159–167. [Google Scholar] [CrossRef]
Sun, S.; Song, H.; He, D. An adaptive segmentation method combining MSRCR and mean shift algorithm with K-means correction of green apples in natural environment. Inf. Process. Agric. 2019, 6, 200–215. [Google Scholar] [CrossRef]
Singh, N.; Tewari, V.K.; Biswas, P.K. Image processing algorithms for in-field cotton boll detection in natural lighting conditions. Artif. Intell. Agric. 2021, 5, 142–156. [Google Scholar] [CrossRef]
Chen, Y.; Xu, Z.; Tang, W. Identification of various food residuals on denim based on hyperspectral imaging system and combination optimal strategy. Artif. Intell. Agric. 2021, 5, 125–132. [Google Scholar] [CrossRef]
Bargoti, S.; Underwood, J.P. Image segmentation for fruit detection and yield estimation in apple orchards. J. Field Robot. 2017, 34, 1039. [Google Scholar] [CrossRef]
Xiong, J.; Zou, X.; Chen, L. Visual positioning of a picking manipulator for perturbed litchi. Trans. Chin. Soc. Agric. Eng. 2012, 28, 36–41. [Google Scholar]
Zhuang, J.; Hou, C.; Tang, Y. Computer vision-based localisation of picking points for automatic litchi harvesting applications towards natural scenarios. Biosyst. Eng. 2019, 187, 1–20. [Google Scholar] [CrossRef]
Benavides, M.; Cantón-Garbín, M.; Sánchez-Molina, J.A. Automatic tomato and peduncle location system based on computer vision for use in robotized harvesting. Appl. Sci. 2020, 10, 5887. [Google Scholar] [CrossRef]
Lü, Q.; Cai, J.; Zhao, J.; Wang, F.; Tang, M. Real-time Recognition of Citrus on Trees in Natural Scene. Trans. Chin. Soc. Agric. Mach. 2010, 41, 170–188. [Google Scholar]
Bulanon, D.M.; Kataoka, T.; Ota, Y.; Hiroma, T. AE-Automation and emerging technologies: A segmentation algorithm for the automatic recognition of fuji apples at harvest. Biosyst. Eng. 2002, 83, 405–412. [Google Scholar] [CrossRef]
Humburg, D.S.; Reid, J.F. Field performance of machine vision for the selective harvest of asparagus. SAE Trans. 1991, 100, 81–92. [Google Scholar]
Liu, X.; Dai, B.; He, H. Real-time object segmentation for visual object detection in dynamic scenes. In Proceedings of the 2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR), Dalian, China, 14–16 October 2011; pp. 423–428. [Google Scholar]
Khoshroo, A.; Arefi, A.; Khodaei, J. Detection of red tomato on plants using image processing techniques. Agric. Commun. 2014, 2, 9–15. [Google Scholar]
Xiong, J.; Lin, R.; Liu, Z. The recognition of litchi clusters and the calculation of picking point in a nocturnal natural environment. Biosyst. Eng. 2018, 166, 44–57. [Google Scholar] [CrossRef]
Fu, J.; Duan, X.; Zou, X. Banana detection based on color and texture features in the natural environment. Comput. Electron. Agric. 2019, 167, 105057. [Google Scholar] [CrossRef]
Wang, D.; He, D.; Song, H. Combining SUN-based visual attention model and saliency contour detection algorithm for apple image segmentation. Multimed. Tools Appl. 2019, 78, 17391–17411. [Google Scholar] [CrossRef]
He, B.; Zhang, Y.; Gong, J.; Fu, G.; Zhao, Y.; Wu, R. Rapid Identification of Tomato Fruits in Nighttime Greenhouses Based on Improved YOLO v5. Trans. Chin. Soc. Agric. Mach. 2022, 53, 201–208. [Google Scholar]
Wachs, J.P.; Stern, H.I.; Burks, T. Low and high-level visual feature-based apple detection from multi-modal images. Precis. Agric. 2010, 11, 717–735. [Google Scholar] [CrossRef]
Gen’e-Mola, J.; Vilaplana, V.; Rosell-Polo, J.R.; Morros, J.-R.; Ruiz-Hidalgo, J.; Gregorio, E. Multi-modal deep learning for Fuji apple detection using RGB-D cameras and their radiometric capabilities. Comput. Electron. Agric. 2019, 162, 689–698. [Google Scholar] [CrossRef]
Fu, L.; Wang, B.; Cui, Y. Kiwifruit recognition at nighttime using artificial lighting based on machine vision. Int. Agric. Biol. Eng. 2015, 8, 52–59. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 August 2015. [Google Scholar]
Zhao, D.; Liu, X.; Chen, Y. Nighttime recognition method for apple harvesting robots. J. Agric. Mach. 2015, 46, 15–22. [Google Scholar]
Liu, X.; Zhao, D.; Jia, W.; Ruan, C.; Tang, S.; Shen, T. A method of segmenting apples at night based on color and position information. Comput. Electron. Agric. 2016, 122, 118–123. [Google Scholar] [CrossRef]
Kitamura, S.; Oka, K.; Ikutomo, K.; Kimura, Y.; Taniguchi, Y. A Distinction Method for Fruit of Sweet Pepper Using Reflection of LED Light. In Proceedings of the Annual Conference of the SICE, Chofu, Japan, 20–22 August 2008. [Google Scholar]
Xiong, J.; Bu, R.; Guo, W.; Chen, S.; Yang, Z. Surface Shadow Removal Method for Fruit Recognition of Harvesting Robots Under Natural Lighting Conditions. Trans. Chin. Soc. Agric. Eng. 2018, 34, 147–154. [Google Scholar]
Wu, T.P.; Tang, C.K. A bayesian approach for shadow extraction from a single image. In Proceedings of the Tenth IEEE International Conference on Computer Vision, Beijing, China, 20–26 June 2005; pp. 480–487. [Google Scholar]
Han, G.; Cosker, D. User-assisted image shadow removal. Image Vis. Comput. 2017, 62, 19–27. [Google Scholar]
Liu, Y.; Shi, J.; Zhang, Y. Shadow Removal Algorithm for Single Outdoor Image. J. Softw. 2012, 23, 168–175. [Google Scholar]
Levine, M.D.; Bhattacharyya, J. Removing shadows. Pattern Recognit. Lett. 2005, 26, 251–265. [Google Scholar] [CrossRef]
Qu, L.; Tian, J.; Han, Z. Pixel-wise orthogonal decomposition for color illumination invariant and shadow-free image. Opt. Express 2015, 23, 2220–2239. [Google Scholar] [CrossRef]
Shen, L.; Tan, P.; Lin, S. Intrinsic image decomposition with non-local texture cues. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 24–26 July 2008; pp. 1–7. [Google Scholar]
Shen, L.; Yeo, C. Intrinsic images decomposition using a local and global sparse representation of reflectance. In Proceedings of the Computer Vision and Pattern Recognition 2011, Colorado Springs, CO, USA, 21–23 July 2011; pp. 697–704. [Google Scholar]
Laffont, P.Y.; Bousseau, A.; Paris, S. Coherent intrinsic images from photo collections. ACM Trans. Graph. 2012, 31, 1–11. [Google Scholar] [CrossRef]
Figov, Z.; Koppel, M. Detecting and removing shadows. In Proceedings of the International Conference on Computer Graphics and Imaging, Las Vegas, NV, USA, June 30–3 July 2004. [Google Scholar]
Baba, M.; Mukunoki, M.; Asada, N. Shadow removal from a real image based on shadow density. In Proceedings of the ACM SIGGRAPH, Los Angeles, CA, USA, 8–12 August 2004. [Google Scholar]
Weiss, Y. Deriving intrinsic images from image sequences. In Proceedings of the Eighth IEEE International Conference on Computer Vision, Montreal, BC, Canada, 14–17 July 2001; pp. 68–75. [Google Scholar]
Matsushita, Y.; Lin, S.; Kang, S.B. Estimating intrinsic images from image sequences with biased illumination. In Proceedings of the European Conference on Computer Vision, Prague, Czech Republic, 11–14 May 2004; pp. 274–286. [Google Scholar]
Guo, A.X.; Zou, X.J.; Zhu, M.S.; Chen, Y.; Xiong, J.; Chen, L. Analysis and recognition of color characteristics of litchi fruit and fruit clusters based on exploratory analysis. Trans. Chin. Soc. Agric. Eng. 2013, 29, 191–198. [Google Scholar]
Xiong, J.T.; Zou, X.J.; Wang, H.J. Mature litchi identification under different lighting conditions based on Retinex image enhancement. Trans. Chin. Soc. Agric. Eng. 2013, 29, 170–178. [Google Scholar]
Peng, H.X.; Zou, X.J.; Chen, L.J. Fast identification of multi-color targets of litchi in the field based on dual-threshold Otsu algorithm. Trans. Chin. Soc. Agric. Mach. 2014, 45, 61–68. [Google Scholar]
Ding, Y.; Lee, W.S.; Li, M. Feature extraction of hyperspectral images for detecting immature green citrus fruit. Front. Agric. Sci. Eng. 2018, 5, 475–484. [Google Scholar] [CrossRef]
Zhuang, J.J.; Luo, S.M.; Hou, C.J.; Tang, Y.; He, Y.; Xue, X.Y. Detection of orchard citrus fruits using a monocular machine vision-based method for automatic fruit picking applications. Comput. Electron. Agric. 2018, 152, 64–73. [Google Scholar] [CrossRef]
Saedi, S.I.; Khosravi, H. A deep neural network approach towards real-time on-branch fruit recognition for precision horticulture. Expert Syst. Appl. 2020, 159, 113594. [Google Scholar] [CrossRef]
Xu, Y. Two-stage approach for detecting slightly overlapping strawberries using HOG descriptor. Biosyst. Eng. 2013, 115, 144–153. [Google Scholar] [CrossRef]
Lin, Y.; Lv, Z.; Yang, C.; Lin, P.; Chen, F.; Hong, J. Recognition and Experiment of Overlapping Honey Pomelo in Natural Scene Images. Trans. Chin. Soc. Agric. Eng. 2021, 37, 158–167. [Google Scholar]
Abdulla, A.A.; Ahmed, M.W. An improved image quality algorithm for exemplar-based image inpainting. Multimed. Tools Appl. 2021, 80, 13143–13156. [Google Scholar] [CrossRef]
Hedjazi, M.A.; Genc, Y. Efficient texture-aware multi-GAN for image inpainting. Knowl.-Based Syst. 2021, 217, 106789. [Google Scholar] [CrossRef]
Arun, P.L.; Kumar, R.M.S. Non-linear sorenson-dice exemplar image inpainting based bayes probability for occlusion removal in remote traffic control. Multimed. Tools Appl. 2021, 80, 11523–11538. [Google Scholar] [CrossRef]
Lv, J.D.; Wang, F.; Xu, L.M.; Ma, Z.H.; Yang, B. A segmentation method of bagged green apple image. Sci. Hortic. 2019, 246, 411–417. [Google Scholar] [CrossRef]
Tian, Y.; Duan, H.; Luo, R.; Zhang, Y.; Jia, W.; Lian, J.; Zheng, Y.; Ruan, C.; Li, C. Fast recognition and location of target fruit based on depth information. IEEE Access 2019, 7, 170553–170563. [Google Scholar] [CrossRef]
Liang, X.; Jin, C.; Ni, M. Acquisition and Experiment of the Position Information of Tomato Fruit String Picking Points. Trans. Chin. Soc. Agric. Eng. 2018, 34, 163–169. [Google Scholar]
Lin, S.; Wang, N. Cloud robotic grasping of Gaussian mixture model based on point cloud projection under occlusion. Assem. Autom. 2021, 41, 312–323. [Google Scholar] [CrossRef]
Chen, L. Research on the Strawberry Harvest Robot Picking System. Master’s Thesis, China Agricultural University, Beijing, China, 2005. [Google Scholar]
Chang-Yong, L.I.; Fang, A.Q.; Tan, H. Elevated Strawberry Picking Robot System Research. Mach. Des. Manuf. 2017, 6, 245–247. [Google Scholar]
Shiigi, T.; Kurita, M.; Kondo, N.; Ninomiya, K.; Rajendra, P.; Kamata, J. Strawberry harvesting robot for fruits grown on tabletop culture. In Proceedings of the American Society of Agricultural and Biological Engineers, Providence, RI, USA, 29 July–20 August 2008; p. 084046. [Google Scholar]
Liu, Z.; Liu, G.; Qiao, J. Three-Dimensional Visual Sensor Design of Apple Harvesting Robot. Trans. Chin. Soc. Agric. Mach. 2010, 41, 171–175. [Google Scholar]
Hayashi, S.; Shigematsu, K.; Yamamoto, S.; Kobayashi, K.; Kohno, Y.; Kamata, J. Evaluation of a strawberry-harvesting robot in a field test. Biosyst. Eng. 2010, 105, 160–171. [Google Scholar] [CrossRef]
Zou, X.J.; Ye, M.; Luo, C.H. Fault-tolerant design of a limited universal fruit-picking end-effector based on vision-positioning error. Appl. Eng. Agric. 2016, 32, 5–18. [Google Scholar]
Africa, A.D.M.; Tabalan, A.R.V.; Tan, M.A.A. Ripe fruit detection and classification using machine learning. Int. J. 2020, 8, 60852020. [Google Scholar] [CrossRef]
Xiang, R.; Duan, P.F. Design and experiment of night lighting system for tomato harvesting robots. J. Agric. Mach. 2016, 47, 8–14. [Google Scholar]
Lv, J.; Zhao, D.A.; Wei, J.; Ding, S.H. Recognition of Overlapping and Occluded Fruits in Natural Environment. Optik 2016, 127, 1354–1362. [Google Scholar] [CrossRef]
Wasaki, F.; Imamura, H. A robust recognition method for occlusion of mini tomatoes based on hue information and the curvature. Int. J. Image Graph. 2015, 15, 1540004. [Google Scholar] [CrossRef]

Figure 3. Purposes of different image detection algorithms: (a) branch and leaf segmentation [66]; (b) fruit and vegetable image segmentation [7]; (c) fruit and vegetable image detection [67]; (d) branch and leaf detection [68]; (e) fruit and vegetable detection [9]; (f) flowchart for obtaining the MFBB mask [69]. (g–j) Results of stem and calyx recognition: (g) results of stem and calyx recognition at gray level; (h) 3D surface reconstruction image with a standard spherical model image; (i) ratio image of 3D surface reconstruction image and the standard spherical model image; (j) results of stem and calyx recognition [70].

Figure 4. The proportion of image segmentation algorithm applications in harvest recognition and localization.

Figure 5. YOLO Model.

Figure 6. The proportion of applied YOLO model optimization methods in harvest recognition and localization.

Figure 7. Complex agricultural environments.

Figure 8. The research proportion of the main challenges in fruit and vegetable picking robot recognition and localization: (A) different lighting environments; (B) overlap and occlusion; (C) complex background; (D) uncertainty in harvesting; (E) the research proportion of each challenge.

Table 2. Image segmentation algorithms for harvesting recognition and localization.

Algorithms	Image Segmentation Algorithms	Module	Cite References	Object	Detection Time	Detection Accuracy
Traditional segmentation	Depth thresholding segmentation	HSV thresholding	[71,72,73] [74,75]	Tomato, orange	2.34 s	83.5–93%
	Similarity measure segmentation	NCC,K-means	[46,66,76] [33,77,78]	Tomato, orange, lychee, cucumber	0.054–7.58 s	85–98%
	Image binarization segmentation	Otsu	[79]	Grape	0.61 s	90%
	Shape segmentation algorithm	Hough circle transform	[75,80]	Banana, apple	0.009–0.13 s	93.2%
Machine learning	Semantic segmentation algorithms	PSP-Net semantic segmentation, U-Net semantic segmentation	[81,82]	Lychee, cucumber	-	92.5–98%
Machine learning	Instance segmentation algorithms	Mask R-CNN and YOLACT	[7,83,84] [69,85,86]	Tomato, strawberry, lychee	0.04–0.154 s	89.7–95.78%

Table 3. YOLO model optimization algorithm.

Specific Methods	Cite References	Object	Detection Time	Detection Accuracy
Introducing residual modules ResNet	[67,87]	Tomato, lychee	0.017–0.093 s	94.44–97.07%
Modifying or replacing the backbone feature extraction network	[8,68,88,89,90,91]	Citrus, tea tooth, cherry, apple, green peach	0.01–0.063 s	86.57–97.8%
Applying the K-means clustering algorithm for combining predicted candidate boxes	[43,92,93,94]	Tomato, citrus, lychee, cherry tomato	0.058 s	79–94.29%
Incorporating attention mechanism modules	[91,92,95]	Apple, tomato	0.015–0.227 s	86.57–97.5%
Enhancing the activation function	[89,91,96,97]	Apple, tomato, lychee, navel orange, Emperor orange	0.467 s	94.7%

Table 4. Object three-dimensional reconstruction algorithm in crop harvesting recognition and localization.

Specific Methods	Cite References	Object	Rebuilding Accuracy
Density-based point clustering and localization approximation method	[9]	Strawberry	74.1%
Nearest point iteration algorithm	[99]	Apple	85.49%
Delaunay triangulation method	[70]	Apple	97.5%
Three-dimensional reconstruction algorithm based on iterative closest point (ICP)	[100]	Apples, bananas, cabbage, pears	-

Table 5. Challenges and solutions in the recognition and positioning of robots for fruit and vegetable harvesting.

Challenges in Recognition and Positioning	Solutions	Cite References	Object	Time	Accuracy Rate
Complex background	Deep learning technology	[104,105,106,107,108,109] [110,111] [112,113,114]	Orange, apple, green apple, lime, cucumber	0.06–0.352 s	85.49–90.75%
	Based on color features	[115,116,117,118]	Apple, tomato	0.017 s	43.9%
	Limitations of color features	[115,119,120,121] [68,122,123]	Cucumber	0.346 s	89.47%
	Based on spatial relationships	[124,125,126]	Lychee, tomato	0.03 s	80.8%
	Removing background interference	[118,127,128,129] [130,131,132,133,134]	Lychee, banana	0.343–0.516 s	89.63–93.75%
Different lighting environments	Research in nighttime environments	[132,135,136,137] [49,78,138,139]	Kiwi, lychee, tomato, green apple	0.516 s	74–96.2%
	Adding light sources	[140,141,142]	Apple, tomato, green pepper	-	67.79–80.8%
	Removing shadows	[143,144,145,146,147,148,149,150,151] [152,153,154,155]	-	-	83.16%
	Research under natural lighting conditions	[63,156,157] [11,96,158,159]	Green pepper, lychee, green orange, tomato	0.105–0.2 s	59.2–94.75%
	Handling uneven lighting	[49,160]	-	-	86%
Overlap occlusion	Directly detecting obstructed and overlapping fruit images	[1,86,161,162]	Strawberry	0.008 s	87–99.8%
	Classifying and recognizing obstacles and unobstructed fruit	[91,160,163]	Apples, citrus, pomelo	0.015 s	91.48–94.02%
	Image restoration	[160,164,165,166]	-	-	95.96–99.3%
	Computation and multi-sensor detection	[167,168,169,170]	Apples, tomato, cherry tomato	-	78.8–96.61%
Uncertainty in harvesting	Reducing overall vibrations	[50,171,172,173,174]	Strawberry	-	38%
	Sensor interference	[175]	Strawberry	11.5 s	38.1%
	Establishing a fault-tolerant mathematical model	[176]	Lychee, citrus	-	78%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, G.; Chen, H.; Jiang, M.; Niu, R. An Overview of the Application of Machine Vision in Recognition and Localization of Fruit and Vegetable Harvesting Robots. Agriculture 2023, 13, 1814. https://doi.org/10.3390/agriculture13091814

AMA Style

Hou G, Chen H, Jiang M, Niu R. An Overview of the Application of Machine Vision in Recognition and Localization of Fruit and Vegetable Harvesting Robots. Agriculture. 2023; 13(9):1814. https://doi.org/10.3390/agriculture13091814

Chicago/Turabian Style

Hou, Guangyu, Haihua Chen, Mingkun Jiang, and Runxin Niu. 2023. "An Overview of the Application of Machine Vision in Recognition and Localization of Fruit and Vegetable Harvesting Robots" Agriculture 13, no. 9: 1814. https://doi.org/10.3390/agriculture13091814

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Overview of the Application of Machine Vision in Recognition and Localization of Fruit and Vegetable Harvesting Robots

Abstract

1. Introduction

2. Vision Recognition and Positioning System for Fruit and Vegetable Harvesting Robots

2.1. Visual Sensors

2.1.1. Monocular Camera

2.1.2. Stereo Camera

2.1.3. Structured Camera

2.1.4. Multispectral Camera

2.2. Machine Vision Algorithms

2.2.1. Image Segmentation Algorithms

2.2.2. Object Detection Algorithm

2.2.3. A 3D Reconstruction Algorithm for Object Models

3. The Challenges of Machine Vision in Recognition and Localization for Fruit and Vegetable Harvesting Robots

3.1. The Current Status of Machine Vision in Recognition, Localization, and Harvesting for Fruit and Vegetable Harvesting Robots

3.1.1. Recognition and Localization of Machine Vision in Greenhouse Environments

3.1.2. Recognition and Localization of Machine Vision in Outdoor Greenhouse Environments

3.2. The Significant Challenges Faced by Machine Vision in Recognition and Localization for Fruit and Vegetable Harvesting Robots

3.2.1. The Stability of Fast Recognition under Complex Background Interference

3.2.2. Identifying Stability under Different Lighting Conditions for the Same Crop

3.2.3. The Dependence of Recognition and Localization Functions on Prior Information in the Case of Overlapping Fruits and Occluded Leaves and Branches

3.2.4. Uncertainty in Fruit Picking Due to Complex Work Environments

4. Conclusions

5. Future Prospects

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI