Sage Journals: Discover world-class research

Abstract

Fast and accurate 3D scene perception is a crucial prerequisite for the autonomous navigation and harvesting of combine harvesters. However, crop field scenarios pose severe challenges for vision-based perception systems due to repetitive scenes, illumination changes and real-time constraints on embedded computing platforms. In this paper, we propose a feature-based, two-stage approach for real-time dense 3D mapping for combine harvesters. In the first stage, our approach constructs a sparse 3D map using reliable feature matching, which provides prior knowledge about the environment. In the second stage, our method formulates per-pixel disparity calculation as probabilistic inference. The key to our approach is the ability to compute dense 3D maps by combining Bayesian estimation with efficient and discriminative point cues from images, exhibiting tolerance against visual measurement uncertainties due to repetitive textures and uneven lighting in crop fields. We validate the performance of the proposed method using real crop field data, and the results demonstrate that our dense 3D maps provide detailed spatial metric information while maintaining a balance between accuracy and efficiency. This makes our approach highly valuable for online perception in combine harvesters operating with resource-limited systems.

Keywords

Combine harvesters FREAK feature sparse stereo matching Bayesian inference dense 3D map

Introduction

Autonomous combine harvesters, which interact richly with their physical environment, require accurate three-dimensional (3D) layout knowledge of crop fields to facilitate navigation and harvesting tasks. For instance, auto-steering control relies on precise localization of crop rows or cut edges while travel speed adjustment relies on crop height, biomass, and terrain profile. Dense 3D mapping techniques involve measuring the geometric structure of an environment and have been extensively studied in both academia and industry. Mainstream methods for dense 3D mapping fall into two primary categories: LiDAR-based methods and camera-based methods. LiDAR sensors directly provide accurate depth information, enabling easy construction of 3D maps in the form of point clouds. However, their high cost and power consumption limit their widespread usage in agricultural vehicles with restricted computational resources. In contrast to LiDAR sensors, vision sensors are lightweight, cost-effective, passive, and provide rich information about the appearance, color and texture of the environment, making them easy to deploy on low-cost and power-efficient agricultural vehicles.

Despite the tremendous progress in vision-based 3D sensing, applying existing algorithms to autonomous combine harvesters presents challenges in achieving satisfactory performance in both accuracy and efficiency. As shown in Figure 1(b), the unstructured and natural crop fields pose specific challenges: on one hand, highly similar textures combined with uneven illumination conditions make visual measurements an intractable task; on the other hand, real-time dense mapping of large-scale scenes imposes a large computational burden and memory requirement on a resource-constrained mobile platform. To tackle those challenges, carefully designed and specifically tuned visual-perception algorithms are needed to perform well in agricultural scenarios.

Figure 1.

A dense 3D mapping example of a real crop field. (a) Experimental platform; (b) a real crop field; (c) depth map.

In this work, we present a feature-based, two-stage approach that enables fast and dense mapping of crop fields observed by a vehicle-mounted stereo camera. Our proposed approach models the 3D geometric scenes by combining Bayesian inference and distinctive point cues on images. Figure 1 exhibits the test vehicle, test environment (crop fields), and the output of our method. The primary advantage of the proposed method is that it establishes reliable sparse stereo matching with informative and discriminative key points. This, combined with rigorous validation to remove outliers, effectively handles the matching ambiguity problem. Extensive evaluation was performed on real crop field images, and the results demonstrate that our method achieves both accuracy and speed and is well-suited for 3D dense perception for combine harvesters.

Traditional feature-based Visual Odometry (VO) or Simultaneous Localization and Mapping (SLAM) systems track a sparse set of 3D points for state estimation. Although these sparse 3D points are insufficient for tasks involving interaction with the environment, they roughly represent the geometric structure of the 3D scene. Our method aims to densify the sparse 3D points, building upon our previous work—a feature-based VO system.¹ This approach is motivated by the fact that crop fields have rich textures, allowing for the reliable and rapid extraction of a finite number of 3D points. In this work, we construct sparse stereo correspondences using a combination of the FAST detector and the FREAK descriptor, which are implemented based on local intensity value comparison/order statistics rather than intensity values themselves. This makes them insensitive to illumination variations and noise. Additionally, the use of intensity order is easy to implement, fast to match, and memory efficient. A notable property of the FREAK descriptor is its discriminative capacity, combined with an outlier rejection scheme, which yields tolerance against similar textures in crop fields.²

The primary contributions of our work can be summarized as follows:

We have developed a specialized 3D mapping algorithm for combine harvesters that effectively balances accuracy and speed, even under the computational constraints and complex conditions typical of agricultural settings.

ur approach uniquely combines discriminative feature points with Bayesian modeling to achieve accurate dense disparity calculations in real-time.

Through extensive real-world testing in crop fields, we have rigorously validated our algorithm's performance, confirming its accuracy, speed, and robustness.

The remainder of this article is organized as follows: Section 2 presents related work on the vision-based environment perception for agricultural applications. Section 3 gives a detailed description of our approach, including a sparse 3D map construction procedure and the mathematical derivation of the Bayesian inference for dense disparity computation. Section 4 presents extensive experiments and gives ideas for future work. Finally, we conclude this work in Section 5.

Related work

The problem of mapping agricultural scenes involves obtaining 3D geometry information of the environment from 2D images, including terrain, plant traits, crop rows or cut edges, etc.^3,4 Rovira-Más et al.⁵ described the benefits of mapping a vehicle's surroundings for precision agriculture. They presented a perception system that generates 3D terrain maps of agricultural scenes using stereo cameras to create 3D point clouds and transform them into geodetic coordinates, which are then assembled into a global field map. Their results were validated in multiple field tests. Ro and Nn⁶ introduced a visual perception system capable of mapping the crop row by a Fujifilm camera, suitable for spraying applications. Gai et al.⁷ developed a vision-based under-canopy navigation and mapping system using a front-facing Time-of-Flight (ToF) camera. Their proposed system provided interrow vehicle localization data and generated crop field maps as occupancy grids. Test results demonstrated the feasibility of robotic navigation between crop rows using a ToF camera under crop canopies. Considering that ToF or structured light cameras are susceptible to sunlight and monocular cameras are prone to scale drift, stereo cameras are promising sensors for large-scale agricultural scenario sensing due to their ability to obtain depth information inexpensively.

Stereo vision-based 3D reconstruction is a popular topic in the computer vision and robotics communities. According to common categories proposed by Scharstein,⁸ most stereo vision algorithms could be classified into global and local methods, typically performing the following four steps: matching cost calculation, cost aggregation, disparity optimization, and disparity refinement. Global methods perform global energy function minimization based on explicit smoothness assumptions, which achieve high accuracy but are usually impractical for real-time applications on resource-limited platforms. Local methods are generally fast, but less accurate and difficult to determine the adequate size of local correlation windows. Hirschmüller⁹ proposed an efficient algorithm, semiglobal stereo matching (SGM), which used HMI-based matching cost and implemented an optimization strategy based on the consistency along image rows, columns, and diagonals. The SGM achieves competitive performance in both accuracy and speed, and we compare it to our method.

In between these two general categories are feature-based algorithms that utilize sparse disparity maps to infer dense maps by propagating reliable disparity values to neighboring pixels. Geiger et al.¹⁰ proposed a novel binocular stereo approach, called efficient large-scale stereo matching (ELAS), which firstly computes a set of robustly matched support points by implementing a 3 × 3 Sobel filter in the local area. This provides valuable prior information for the disparity calculation of remaining pixels. This greatly reduces the search range and produces an accurate and dense 3D map. Jellal et al.¹¹ present a modified ELAS, called LS-ELAS, by adding line segments to determine the support points, which increases the performance and robustness when the images have line/edge features.

Recent advances in deep learning revolutionize computer vision-related tasks, and learning-based dense 3D mapping algorithms have achieved impressive performance on most popular datasets.^12–14 However, the overwhelming performance comes at the cost of high computational cost and memory requirements, rendering these approaches unsuitable on agricultural devices with limited processing power, especially for large-scale, high-resolution 3D dense mapping tasks.

Materials and methods

Overview

This section details the process of constructing a dense depth map of a crop field. This method is based on the idea that a finite number of distinct 3D points can be easily obtained by reliable stereo matching, which provides an accurate and prior representation of the crop field in the form of triangular mesh. The disparity of each pixel is modeled as Bayesian inference, allowing for efficient and robust generation of dense 3D maps. Figure 2 illustrates the flowchart and the primary components of our method, which mainly consists of two stages: sparse 3D point generation and dense disparity computation.

Figure 2.

A flowchart of the proposed method.

Sparse 3D points generating

(1) Image preprocessing

Our method begins with image undistortion and stereo rectification using known parameters, ensuring that the epipolar lines are aligned horizontally. Next, the input stereo image is converted to 8 bpp greyscale since the keypoint detection and description are based on pixel intensity. Lastly, we apply a Gaussian filter to normalize image brightness and reduce noise.

(2) Keypoints extraction and description

We extract FAST keypoints from the preprocessed stereo images. We observe that when the FAST detector is applied directly to rich-textured agricultural images, excessive and clustered keypoints are generated. To achieve a homogeneous distribution of keypoints, we divide the stereo image into a two-dimensional grid with H rows and W columns (the parameters H and W are proportional to the image resolution). Keypoint detection and description are performed independently for each grid cell, ensuring real-time performance based on multicore CPU and parallel implementation. We use the Harris corner response function¹⁵ to rank the features in descending order and retain the top 10 keypoints. A small FAST threshold is employed to detect a sufficient number of keypoints in each cell, and an adaptive threshold for the Harris corner responses is used to preserve the desired number of keypoints. For each retained FAST keypoint, we compute its FREAK descriptor, which encodes local image structures into a binary string in a discriminative manner.

(3) Sparse stereo matching

After keypoints detection and description, the next step is to perform stereo matching, as shown in Figure 3. For each FAST keypoint in the left image, we search for its correspondence in the right image along the same row, as an input stereo image is rectified. We utilize the Hamming distance to compute the dissimilarity measures of two descriptors (i.e., the number of bits that differ), which can be done very quickly by using an XOR and bit count operation on modern CPUs containing the Streaming SIMD Extensions instruction set. To further increase the correspondence robustness, the distance ratio test¹⁶ is used to reject all ambiguous correspondences when $D (i, p) / D (i, q) > τ$ . $D (i, p)$ denotes the hamming distance of the best match, and $D (i, q)$ denotes the second-best match. An appropriate threshold can be used to reject false positives, (empirically, $τ = 0.9$ ). Stereo triangulation is performed to obtain the depth Z of the 3D points once the stereo correspondences are obtained, as shown in Figure 4. The coordinate of 3D points $P = [X, Y, Z]^{T}$ is obtained as follows: $X = u^{l} \frac{B}{d}, Y = v^{l} \frac{B}{d}, Z = f \frac{B}{u^{l} - u^{r}} = f \frac{B}{d}$ (1)where f is the focal length, B is the baseline of the stereo camera. The inverse of the depth Z of a pixel is disparity, denoted as $d \propto Z^{- 1}$ , which is smaller and more Gaussian than depth values when modeled as Bayesian inference.

Figure 3.

FAST keypoints are detected and matched in stereo frame.

Figure 4.

Illustration of the stereo triangulation.

Dense disparities computation

(1) Delaunay triangulation

Triangle meshes are constructed using Delaunay triangulation based on these 3D points¹⁷ obtained in previous steps, as shown in Figure 5. The vertices of the triangles form a set of support points $S = {s_{1}, \dots, s_{M}}$ ,, represented as $s_{m} = (u_{m}, v_{m}, d_{m})$ , $(u_{m}, v_{m}) \in R^{2}$ denotes its image coordinates (projection of a 3D triangle on the image), $d_{m} ϵ R$ denotes its disparity. For a pixel n, let $m_{n} = (u_{n}, v_{n}, c_{n})$ be its visual measurement, $(u_{n}, v_{n}) ϵ R^{2}$ is its image coordinates, the $c_{n}$ is a low-dimensional descriptor, computed from $5 \times 5$ Census transform by encoding the pixel n in a 24-bit string.¹⁸ The purpose of using the census transform is that it is more robust than the intensity value of pixel n and is also insensitive to visual measurement noise and illumination changes in the crop field. Without loss of generality, $m_{n}^{l}$ and $m_{n}^{r}$ are denoted as the visual measurements in the left image (reference) and right image (target), respectively.

Figure 5.

Probabilistic modeling illustration of disparity calculation for pixel n.

(2) Disparity calculation

We followed Geiger's strategy to model stereo matching into probabilistic generative models,¹⁰ as shown in Figure 5. Assuming that the visual measurements ${m_{n}^{l}, m_{n}^{r}}$ in the left and right images and support points S are independent at the condition of given their disparities, the joint probability model can be factored as (2). $p (d_{n}, m_{n}^{l}, m_{n}^{r}, S) \propto p (d_{n} | S, m_{n}^{l}) p (m_{n}^{r} | m_{n}^{l}, d_{n})$ (2)Then the disparities computation is formulated as a Bayesian estimation problem, and the disparity $d_{n}$ of pixel n will be computed using the maximum a-posteriori estimation shown in equation (3). $d_{n}^{M A P} = argmax p (d_{n} | m_{n}^{l}, m_{1}^{r}, \dots, m_{N}^{r}, S)$ (3)The posterior can be factorized with the prior and likelihood terms as follows: $p (d_{n} | m_{n}^{l}, m_{1}^{r}, \dots, m_{N}^{r}, S)$ $\propto p (d_{n} | S, m_{n}^{l}) p (m_{1}^{r}, \dots, m_{N}^{r} | m_{n}^{l}, d_{n})$ $\propto p (d_{n} | S, m_{n}^{l}) \sum_{i = 1}^{N} p (m_{i}^{r} | m_{n}^{l}, d_{n})$ (4)The $p (d_{n} | S, m_{n}^{l})$ is the prior term, modeled to be a combination of uniform distribution and Gaussian distribution. The mean $μ (S, m_{n}^{l})$ depends on the set S of support points and visual measurement on the left (reference) image, which is a piecewise linear function and is obtained by interpolating the disparities of the triangle vertices, computed by $μ_{i} (m_{n}^{l}) = a_{i} u_{n} + b_{i} v_{n} + c_{i}$ (5)Where the pixel n is inside the triangle plane i, its parameters $(a_{i}, b_{i}, c_{i})$ are obtained by solving a linear equation based on the disparity value $d_{m}$ of triangle vertexes. $N_{s}$ is the set of all keypoints disparities in a fixed region $(30 \times 30 p i x e l)$ neighborhood around pixel n $(u_{n}^{l}, v_{n}^{l})$ . The Delaunay triangulation is used to constrain the search range to a small set (radius of 3 $σ$ pixels) of candidate disparities around the mean $μ (S, m_{n}^{l})$ . $p (d_{n} | S, m_{n}^{l})$ $\propto {\begin{matrix} γ + \exp (- \frac{{(d_{n} - μ (S, m_{n}^{l}))}^{2}}{2 σ^{2}}) if | d_{n} - μ | < 3 σ \\ 0 otherwise \end{matrix}$ (6) $p (m_{n}^{r} | m_{n}^{l}, d_{n})$ is the image likelihood modeled as Laplace distribution. $p (m_{n}^{r} | m_{n}^{l}, d_{n})$ $\propto {\begin{matrix} \exp (- \frac{h a m m i n g [C_{n}^{l},_{n}^{r}]}{β}) if (\begin{matrix} u_{n}^{l} \\ v_{n}^{l} \end{matrix}) = (\begin{matrix} u_{n}^{r} + d_{n} \\ v_{n}^{r} \end{matrix}) \\ 0 otherwise \end{matrix}$ (7)Where $C_{n}^{l}, C_{n}^{r}$ are 8-bit strings Census transform computed from a small neighborhood in the left and right image respectively, $h a m m i n g [C_{n}^{l}, C_{n}^{r}]$ is Hamming distance value between two bit-strings, $γ$ and $β$ is constant parameters (empirically, $γ = 12, β = 25$ ). The if-condition ensures that the search of corresponding points is constrained to the same epipolar line.

We calculate the per-pixel disparity $d_{n}$ by maximizing the posterior (equation (3)), which is equivalent to minimizing the energy function, as shown in equation (8). $E (d_{n}) = \frac{h a m m i n g [C_{n}^{l}, C_{n}^{r}]}{β} - \log [γ + \exp (- \frac{{(d_{n} - μ (S, m_{n}^{l}))}^{2}}{2 σ^{2}})]$ (8)

(3) Disparity refinement

We perform disparity refinement to correct invalid points. The left-right consistency test is employed to detect invalid pixels. For a pixel n in the left image $(u_{n}, v_{n})$ with disparity $d_{n}^{l}$ , its corresponding pixel in the right image $(u_{n} - d_{n}^{l}, v_{n})$ has the disparity $d_{n}^{r}$ , if $| d_{n}^{l} - d_{n}^{r} | < δ$ , the disparity $d_{n}^{l}$ is correct; otherwise, it is invalid. Then, we correct the invalid pixels according to the interpolation formula, as shown in (9). $d = {\begin{matrix} \frac{d_{1} + d_{2}}{2} & if | d_{1} - d_{2} | < σ \\ min (d_{1}, d_{2}) & otherwise \end{matrix}$ (9)Where $σ$ denotes the threshold of discontinuity, $d_{1}$ and $d_{2}$ are disparities of neighboring pixels in the horizontal (vertical) direction of the invalid pixel. Finally, we employ the $3 \times 3$ median filter to eliminate the isolated outliers.¹⁹

Result

This section presents the evaluation of the proposed method in terms of accuracy and real-time performance on real crop field data recorded from a vehicle-mounted stereo camera. A wheeled combine harvester (Model: 4LZ-8F, World Co., Ltd, Danyang, China) was employed as a mobile platform. A stereo camera (S1010-IR-120/Mono, MYNT EYE, Beijing, China) was mounted on top of the cabin, in the middle of the combine harvester, with a forward-looking setup at the height of 3.06 m, as shown in Figure 6. Stereo images were acquired at 15 Hz with 1.5 megapixel resolution and cropped to different resolutions for evaluation. Table 1 shows the technical specifications of MYNT. Due to not being equipped with a laser scanner, the ground-truth disparities were obtained by MYNT EYE SDK (Software Development Kit), calculated by Desktop computers with CUDA Toolkit in offline mode (without real-time constraints).

Figure 6.

MYNT EYE stereo camera installation diagram.

Table 1.

MYNT EYE technical specifications.

Image resolution	1.5/1.0/0.5 megapixels
Frame rate	15Hz
Lens FOV	110°
Pixel size	6.0 × 6.0 μm
Sensor size	1/3″
Shutter	Sync. Global Shutter
Baseline	120 mm
Range	0.5–18 m
Connection	USB3.0

The proposed method has been developed in C++ under Ubuntu 20.04, running on a standard laptop with an Intel Core i7-8550 (4C8 T @ 2.60 GHz) and 16Gb RAM. The MYNT EYE stereo camera is calibrated using the method proposed by Bradski²⁰ to obtain intrinsic and distortion parameters. The OpenCV library is used to perform image I/O operations, contrast enhancement, and feature-related operations. A triangle library was employed to construct the Delaunay triangle mesh.¹⁷ We compared our method with three typical algorithms: Block matching (BM), SGM,⁹ and ELAS¹¹ are freely available in the OpenCV library and implemented based on their default parameter settings.

Accuracy

The root-mean-squared error (RMSE) (equation (10)) and the percentage of bad matching pixels (BMP) (equation (11)) were used as accuracy metrics to quantitatively evaluate the accuracy of the proposed method.²¹ RMSE computes root mean squared error between the estimated disparity $d_{n} (u, v)$ and the ground-truth $d_{T} (u, v)$ , while BMP computes the statistics by counting the number of pixels for which the estimated disparity is off the ground truth by more than $δ$ pixels, $δ$ is the disparity error tolerance. Considering the weakness of the ground truth, our experiments use the 1-pixel, 2-pixel, and 4-pixel threshold errors denoted as $δ = 1.0, 2.0, and 4.0$ , respectively. Four pairs of rectified stereo images with 1-megapixel resolution ( $1280 \times 720$ ) were selected from the dataset, as shown in Figure 7. Occluded regions are not considered in computing these statistics. $R M S E = \sqrt{(\frac{1}{N} \sum_{(u, v) \in N} {| d_{n} (u, v) - d_{T} (u, v) |}^{2})}$ (10) $B - δ = \frac{1}{N} \sum_{(u, v) \in N} (| d_{n} (u, v) - d_{T} (u, v) | > δ)$ (11)where N denotes the effective pixel number of pixels, $(u, v)$ is the pixel coordinate. Table 2 presents the average accuracy and quantitative comparison with other disparity estimation methods. Bold values indicate the best one in the respective category. The comparison confirms that the proposed method achieves competitive results. The state-of-the-art methods achieve poor performance compared to their results of public datasets, indicating that the real crop fields pose a significant challenge for these methods. In contrast, the proposed method is specifically designed for 3D perception in agricultural scenarios, demonstrating better adaptation and robustness in highly textured crop fields.

Figure 7.

Evolution of the average running time.

Table 2.

Average error for all methods on four stereo images (the bold denotes best results).

Method	Image Ⅰ	Image Ⅱ	Image Ⅲ	Image Ⅳ
Method	RMSE/B-1.0/B-2.0/B-4.0	RMSE/B-1.0/B-2.0/B-4.0	RMSE/B-1.0/B-2.0/B-4.0	RMSE/B-1.0/B-2.0/B-4.0
BM	29.8/21.6/17.2/11.5	32.7/28.2/19.4/12.6	32.7/25.6/19.5/11.7	33.9/27.2/20.5/10.8
SGM	20.5/14.2/10.6/5.6	19.7/13.6/8.8/6.7	22.9/16.6/13.5/7.4	23.1/15.6/10.2/6.1
ELSA	27.7/21.5/16.6/11.7	30.4/24.5/18.2/12.8	26.8/18.6/12.2/9.5	25.9/17.6/10.3/8.2
Our method	18.5/12.6/8.2/5.7	16.9/11.4/9.2/6.3	18.9/12.6/9.2/6.9	20.9/14.3/10.8/6.6

Abbreviations: BM: block matching; ELAS: efficient large-scale stereo matching; SGM: semiglobal stereo matching; RMSE: root mean square error.

Runtime

Dense 3D mapping must run in real-time to be practical in the perception system for a combine harvester. To evaluate the real-time performance of the proposed method, four stereo image pairs with varying resolutions ranging from 0.5 MP to 1.5MP were processed and implemented on a standard CPU. Additionally, a coarse estimation of the runtime of the other method was performed under the same conditions. The average runtime of four stereo images of all methods is reported in Figure 7. The results show that the proposed method can estimate a dense 3D map at 3 Hz on 0.5megapixels resolution images, which is sufficient for online perception of combine harvesters with relatively slow speed (0.8∼2.0 m/s).

A coarse estimation of the execution time (milliseconds) of the principal components was conducted, as shown in Table 3. The computation of dense disparity is the most computationally expensive task in the 3D reconstruction process, depending on the image resolution and the size of the Census transform window. For our method, there is always a trade-off between accuracy and speed, and increasing image resolution and size of the Census transform window will improve performance, but at the cost of a high computational requirement.

Table 3.

Time spent in individual parts.

Principal components	Time (ms) [avg]
Preprocessing	11.6
FAST extraction	95.5
FREAK calculation	95.5
Stereo matching	16.3
Triangle mesh	7.8
Disparity calculation	187.6
Disparity refinement	45.2
Total time	354.0

Qualitative results

Figure 8 illustrates the generated disparity maps for several 3D mapping methods from four different scenarios. While all four methods can reconstruct the typical agricultural scenes, including the edge of cut or uncut, lodging areas and obstacles, our method shows a sensible improvement and is suitable for navigation and harvesting tasks. This highlights the effectiveness of our method in agricultural scenarios, where accurate and efficient 3D perception is critical for the successful operation of combine harvesters and other agricultural machinery.

Figure 8.

Qualitative comparison of different 3D mapping methods: the left images of four stereo image pairs (first column), BM (second column), ELAS (third column), SGM (fourth column), and our method (fifth column).

Discussion

The design of a vision-based 3D perception system is deeply task-dependent and requires full consideration of vehicle motion types, environmental characteristics, accuracy requirements, and computational resources. In this work, we aim to reconstruct the 3D geometry of a crop field from a vehicle-mounted stereo camera. We investigate how to extract informative point cues from images of highly textured crop fields and fuse these cues into a Bayesian inference to calculate the disparity value of each pixel. Experimental results demonstrate that our method achieves a good trade-off between accuracy and efficiency.

A shortcoming of our method is the sensing range, which is limited by the fixed baseline length of the stereo camera. One effective solution is to leverage variable baselines in multiview stereo exploiting the vehicle motion, which requires the integration of our algorithm into a VO or SLAM system. Technically, the disparity map can be seen as a 2.5D representation of the crop field. To achieve complete 3D reconstruction and globally consistent representation, we need to compute the extrinsic parameters of the camera and transform all 3D points from camera coordinates to world coordinates. As a feature-based 3D mapping method, the core idea of our method is to leverage a sparse point cloud to infer a dense disparity map, which makes our method easy to integrate into feature-based VO or SLAM systems.

In future research, we intend to optimize frame rates by transitioning from an x86 to an ARM platform with a GPU, enhance algorithmic robustness through diversified data collection and testing under varying conditions, and advance from basic sensing to a more cognitive understanding of the environment by integrating semantic information.

Conclusion

In this article, we propose an accurate and efficient dense 3D mapping method for combine harvesters using stereo cameras, which is essential for autonomous operation. The proposed approach first constructs a sparse 3D map as an a priori representation of the environment and then infers a dense disparity map based on a probabilistic inference model in the second stage. We have compared our method with three other state-of-the-art approaches, and experimental results demonstrate the effectiveness and efficiency of our method in challenging crop field environments. Our method relies solely on a stereo camera operating on a standard PC (without FPGA or GPU acceleration), offering significant advantages in cost, hardware setup, and power consumption. This makes it well-suited for agricultural mobile platforms with limited computational resources. We believe that vision-based perception systems hold great potential for autonomous agricultural vehicles, and we expect our method to be valuable for such mobile platforms.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This research was supported by the Jiangsu Province Modern Agricultural Machinery Equipment and Technology Demonstration and Promotion (NJ2022-08),Youth Fund Project of Jiangsu Natural Science Foundation (BK20210040),Key Laboratory of Modern Agricultural Equipment and Technology (Jiangsu University) (MAET202106),and Natural Science Research of Jiangsu Higher Education Institutions of China (21KJB470018).

ORCID iD

Haiwen Chen

Author biographies

Haiwen Chen,Ph.D.,School of Mechanical Engineering,Jiangsu University. Research areas include autonomous driving and operation of combine harvesters.

Jin Chen,Ph.D.,Professor,School of Mechanical Engineering,Jiangsu University. Research areas include intelligent perception and control of combine harvesters.

Zhuohuai Guan,Ph.D.,Assistant Researcher,Nanjing Research Institute for Agricultural Mechanization,Ministry of Agriculture and Rural Affairs,China. Research area is the design and intelligent technology of combine harvesters.

Yaoming Li,Ph.D.,Professor,School of Agricultural Machinery,Jiangsu University. Research area is harvesting machinery.

Kai Cheng,Master’s degree,School of Mechanical Engineering,Jiangsu University. Research area is harvesting machinery.

Zhihong Cui,Master’s degree,School of Agricultural Machinery,Jiangsu University. Research area is intelligent driving of combine harvesters.

Xingxing Zhang,Master’s degree,Jingjiang College,Jiangsu University. Research area is motor control.

References

Chen

Guan

, et al. Stereovision-based ego-motion estimation for combine harvesters. Sensors 2022; 22: 6394.

Alahi

Ortiz

Vandergheynst

. FREAK: Fast retina keypoint. IEEE Conference on Computer Vision & Pattern Recognition, 2012.

Piyathilaka

Munasinghe

. Vision-only outdoor localization of two-wheel tractor for autonomous operation in agricultural fields. IEEE International Conference on Industrial & Information Systems, 2011.

Wang

Zhang

Rovira-Más

, et al. Stereovision-based lateral offset measurement for vehicle navigation in cultivated stubble fields. Biosystems Eng 2011; 109: 258–265.

Rovira-Más

Zhang

Reid

. Stereo vision three-dimensional terrain maps for precision agriculture. Comput Electron Agric 2008; 60: 133–143.

. Simultaneous mapping and crop row detection by fusing data from wide angle and telephoto images. Comput Electron Agric 2019; 162: 602–612.

Gai

Xiang

Tang

. Using a depth camera for crop row detection and mapping for under-canopy navigation of agricultural robotic vehicle. Comput Electron Agric 2021; 188: 106301.

Scharstein

. View synthesis using stereo vision. Lecture Notes in Computer Science 1583, 1999.

Hirschmüller

. Stereo processing by semiglobal matching and mutual information. IEEE Trans Pattern Anal Mach Intell 2007; 30: 328–341.

10.

Geiger

Roser

Urtasun

. Efficient large-scale stereo matching. Computer Vision - ACCV 2010 - 10th Asian Conference on Computer Vision, Queenstown, New Zealand, November 8-12, 2010, Revised Selected Papers, Part I, 2010.

11.

Jellal

Lange

Wassermann

, et al. LS-ELAS: Line segment based efficient large scale stereo matching. IEEE International Conference on Robotics & Automation, 2017.

12.

Chabra

Straub

Sweeny

, et al. StereoDRNet: Dilated residual stereo net. IEEE, 2019.

13.

Garg

Wang

Hariharan

, et al. Wasserstein distances for stereo disparity estimation, 2020.

14.

Luo

Guan

, et al. P-MVSNet: Learning patch-wise matching confidence aggregation for multi-view stereo. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2020.

15.

Harris

Stephens

. A combined corner and edge detector. Alvey Vision Conference, 1988.

16.

Lowe

. Distinctive image features from scale-invariant keypoints. Int J Comput Vision 2004; 60: 91–110.

17.

Shewchuk

. Triangle: Engineering a 2D quality mesh generator and Delaunay triangulator. FCRC ‘96/WACG ‘96 Selected papers from the Workshop on Applied Computational Geometry, Towards Geometric Engineering, 1996.

18.

Zabih

Woodfill

. Non-parametric local transforms for computing visual correspondence. European Conference on Computer Vision, 1994.

19.

Birchfield

Tomasi

. A pixel dissimilarity measure that is insensitive to image sampling. IEEE Trans Pattern Anal Machine Intell 1998; 20: 401–406.

20.

Bradski

. The OpenCV Library. Dr Dobbs Journal of Software Tools, 2000.

21.

Scharstein

Szeliski

Zabih

. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Stereo and Multi-Baseline Vision, 2001. (SMBV 2001). Proceedings. IEEE Workshop on, 2002.