Abstract
Introduction
Autonomous combine harvesters, which interact richly with their physical environment, require accurate three-dimensional (3D) layout knowledge of crop fields to facilitate navigation and harvesting tasks. For instance, auto-steering control relies on precise localization of crop rows or cut edges while travel speed adjustment relies on crop height, biomass, and terrain profile. Dense 3D mapping techniques involve measuring the geometric structure of an environment and have been extensively studied in both academia and industry. Mainstream methods for dense 3D mapping fall into two primary categories: LiDAR-based methods and camera-based methods. LiDAR sensors directly provide accurate depth information, enabling easy construction of 3D maps in the form of point clouds. However, their high cost and power consumption limit their widespread usage in agricultural vehicles with restricted computational resources. In contrast to LiDAR sensors, vision sensors are lightweight, cost-effective, passive, and provide rich information about the appearance, color and texture of the environment, making them easy to deploy on low-cost and power-efficient agricultural vehicles.
Despite the tremendous progress in vision-based 3D sensing, applying existing algorithms to autonomous combine harvesters presents challenges in achieving satisfactory performance in both accuracy and efficiency. As shown in Figure 1(b), the unstructured and natural crop fields pose specific challenges: on one hand, highly similar textures combined with uneven illumination conditions make visual measurements an intractable task; on the other hand, real-time dense mapping of large-scale scenes imposes a large computational burden and memory requirement on a resource-constrained mobile platform. To tackle those challenges, carefully designed and specifically tuned visual-perception algorithms are needed to perform well in agricultural scenarios.

A dense 3D mapping example of a real crop field. (a) Experimental platform; (b) a real crop field; (c) depth map.
In this work, we present a feature-based, two-stage approach that enables fast and dense mapping of crop fields observed by a vehicle-mounted stereo camera. Our proposed approach models the 3D geometric scenes by combining Bayesian inference and distinctive point cues on images. Figure 1 exhibits the test vehicle, test environment (crop fields), and the output of our method. The primary advantage of the proposed method is that it establishes reliable sparse stereo matching with informative and discriminative key points. This, combined with rigorous validation to remove outliers, effectively handles the matching ambiguity problem. Extensive evaluation was performed on real crop field images, and the results demonstrate that our method achieves both accuracy and speed and is well-suited for 3D dense perception for combine harvesters.
Traditional feature-based Visual Odometry (VO) or Simultaneous Localization and Mapping (SLAM) systems track a sparse set of 3D points for state estimation. Although these sparse 3D points are insufficient for tasks involving interaction with the environment, they roughly represent the geometric structure of the 3D scene. Our method aims to densify the sparse 3D points, building upon our previous work—a feature-based VO system. 1 This approach is motivated by the fact that crop fields have rich textures, allowing for the reliable and rapid extraction of a finite number of 3D points. In this work, we construct sparse stereo correspondences using a combination of the FAST detector and the FREAK descriptor, which are implemented based on local intensity value comparison/order statistics rather than intensity values themselves. This makes them insensitive to illumination variations and noise. Additionally, the use of intensity order is easy to implement, fast to match, and memory efficient. A notable property of the FREAK descriptor is its discriminative capacity, combined with an outlier rejection scheme, which yields tolerance against similar textures in crop fields. 2
The primary contributions of our work can be summarized as follows:
We have developed a specialized 3D mapping algorithm for combine harvesters that effectively balances accuracy and speed, even under the computational constraints and complex conditions typical of agricultural settings. ur approach uniquely combines discriminative feature points with Bayesian modeling to achieve accurate dense disparity calculations in real-time. Through extensive real-world testing in crop fields, we have rigorously validated our algorithm's performance, confirming its accuracy, speed, and robustness.
The remainder of this article is organized as follows: Section 2 presents related work on the vision-based environment perception for agricultural applications. Section 3 gives a detailed description of our approach, including a sparse 3D map construction procedure and the mathematical derivation of the Bayesian inference for dense disparity computation. Section 4 presents extensive experiments and gives ideas for future work. Finally, we conclude this work in Section 5.
Related work
The problem of mapping agricultural scenes involves obtaining 3D geometry information of the environment from 2D images, including terrain, plant traits, crop rows or cut edges, etc.3,4 Rovira-Más et al. 5 described the benefits of mapping a vehicle's surroundings for precision agriculture. They presented a perception system that generates 3D terrain maps of agricultural scenes using stereo cameras to create 3D point clouds and transform them into geodetic coordinates, which are then assembled into a global field map. Their results were validated in multiple field tests. Ro and Nn 6 introduced a visual perception system capable of mapping the crop row by a Fujifilm camera, suitable for spraying applications. Gai et al. 7 developed a vision-based under-canopy navigation and mapping system using a front-facing Time-of-Flight (ToF) camera. Their proposed system provided interrow vehicle localization data and generated crop field maps as occupancy grids. Test results demonstrated the feasibility of robotic navigation between crop rows using a ToF camera under crop canopies. Considering that ToF or structured light cameras are susceptible to sunlight and monocular cameras are prone to scale drift, stereo cameras are promising sensors for large-scale agricultural scenario sensing due to their ability to obtain depth information inexpensively.
Stereo vision-based 3D reconstruction is a popular topic in the computer vision and robotics communities. According to common categories proposed by Scharstein, 8 most stereo vision algorithms could be classified into global and local methods, typically performing the following four steps: matching cost calculation, cost aggregation, disparity optimization, and disparity refinement. Global methods perform global energy function minimization based on explicit smoothness assumptions, which achieve high accuracy but are usually impractical for real-time applications on resource-limited platforms. Local methods are generally fast, but less accurate and difficult to determine the adequate size of local correlation windows. Hirschmüller 9 proposed an efficient algorithm, semiglobal stereo matching (SGM), which used HMI-based matching cost and implemented an optimization strategy based on the consistency along image rows, columns, and diagonals. The SGM achieves competitive performance in both accuracy and speed, and we compare it to our method.
In between these two general categories are feature-based algorithms that utilize sparse disparity maps to infer dense maps by propagating reliable disparity values to neighboring pixels. Geiger et al. 10 proposed a novel binocular stereo approach, called efficient large-scale stereo matching (ELAS), which firstly computes a set of robustly matched support points by implementing a 3 × 3 Sobel filter in the local area. This provides valuable prior information for the disparity calculation of remaining pixels. This greatly reduces the search range and produces an accurate and dense 3D map. Jellal et al. 11 present a modified ELAS, called LS-ELAS, by adding line segments to determine the support points, which increases the performance and robustness when the images have line/edge features.
Recent advances in deep learning revolutionize computer vision-related tasks, and learning-based dense 3D mapping algorithms have achieved impressive performance on most popular datasets.12–14 However, the overwhelming performance comes at the cost of high computational cost and memory requirements, rendering these approaches unsuitable on agricultural devices with limited processing power, especially for large-scale, high-resolution 3D dense mapping tasks.
Materials and methods
Overview
This section details the process of constructing a dense depth map of a crop field. This method is based on the idea that a finite number of distinct 3D points can be easily obtained by reliable stereo matching, which provides an accurate and prior representation of the crop field in the form of triangular mesh. The disparity of each pixel is modeled as Bayesian inference, allowing for efficient and robust generation of dense 3D maps. Figure 2 illustrates the flowchart and the primary components of our method, which mainly consists of two stages: sparse 3D point generation and dense disparity computation.

A flowchart of the proposed method.
Sparse 3D points generating
Our method begins with image undistortion and stereo rectification using known parameters, ensuring that the epipolar lines are aligned horizontally. Next, the input stereo image is converted to 8 bpp greyscale since the keypoint detection and description are based on pixel intensity. Lastly, we apply a Gaussian filter to normalize image brightness and reduce noise.
Keypoints extraction and description
We extract FAST keypoints from the preprocessed stereo images. We observe that when the FAST detector is applied directly to rich-textured agricultural images, excessive and clustered keypoints are generated. To achieve a homogeneous distribution of keypoints, we divide the stereo image into a two-dimensional grid with H rows and W columns (the parameters H and W are proportional to the image resolution). Keypoint detection and description are performed independently for each grid cell, ensuring real-time performance based on multicore CPU and parallel implementation. We use the Harris corner response function 15 to rank the features in descending order and retain the top 10 keypoints. A small FAST threshold is employed to detect a sufficient number of keypoints in each cell, and an adaptive threshold for the Harris corner responses is used to preserve the desired number of keypoints. For each retained FAST keypoint, we compute its FREAK descriptor, which encodes local image structures into a binary string in a discriminative manner.
Sparse stereo matching
After keypoints detection and description, the next step is to perform stereo matching, as shown in Figure 3. For each FAST keypoint in the left image, we search for its correspondence in the right image along the same row, as an input stereo image is rectified. We utilize the Hamming distance to compute the dissimilarity measures of two descriptors (i.e., the number of bits that differ), which can be done very quickly by using an XOR and bit count operation on modern CPUs containing the Streaming SIMD Extensions instruction set. To further increase the correspondence robustness, the distance ratio test
16
is used to reject all ambiguous correspondences when

FAST keypoints are detected and matched in stereo frame.

Illustration of the stereo triangulation.
Dense disparities computation
Triangle meshes are constructed using Delaunay triangulation based on these 3D points
17
obtained in previous steps, as shown in Figure 5. The vertices of the triangles form a set of support points

Probabilistic modeling illustration of disparity calculation for pixel
We followed Geiger's strategy to model stereo matching into probabilistic generative models,
10
as shown in Figure 5. Assuming that the visual measurements
We calculate the per-pixel disparity
We perform disparity refinement to correct invalid points. The left-right consistency test is employed to detect invalid pixels. For a pixel
Result
This section presents the evaluation of the proposed method in terms of accuracy and real-time performance on real crop field data recorded from a vehicle-mounted stereo camera. A wheeled combine harvester (Model: 4LZ-8F, World Co., Ltd, Danyang, China) was employed as a mobile platform. A stereo camera (S1010-IR-120/Mono, MYNT EYE, Beijing, China) was mounted on top of the cabin, in the middle of the combine harvester, with a forward-looking setup at the height of 3.06 m, as shown in Figure 6. Stereo images were acquired at 15 Hz with 1.5 megapixel resolution and cropped to different resolutions for evaluation. Table 1 shows the technical specifications of MYNT. Due to not being equipped with a laser scanner, the ground-truth disparities were obtained by MYNT EYE SDK (Software Development Kit), calculated by Desktop computers with CUDA Toolkit in offline mode (without real-time constraints).

MYNT EYE stereo camera installation diagram.
MYNT EYE technical specifications.
The proposed method has been developed in C++ under Ubuntu 20.04, running on a standard laptop with an Intel Core i7-8550 (4C8 T @ 2.60 GHz) and 16Gb RAM. The MYNT EYE stereo camera is calibrated using the method proposed by Bradski 20 to obtain intrinsic and distortion parameters. The OpenCV library is used to perform image I/O operations, contrast enhancement, and feature-related operations. A triangle library was employed to construct the Delaunay triangle mesh. 17 We compared our method with three typical algorithms: Block matching (BM), SGM, 9 and ELAS 11 are freely available in the OpenCV library and implemented based on their default parameter settings.
Accuracy
The root-mean-squared error (RMSE) (equation (10)) and the percentage of bad matching pixels (BMP) (equation (11)) were used as accuracy metrics to quantitatively evaluate the accuracy of the proposed method.
21
RMSE computes root mean squared error between the estimated disparity

Evolution of the average running time.
Average error for all methods on four stereo images (the bold denotes best results).
Abbreviations: BM: block matching; ELAS: efficient large-scale stereo matching; SGM: semiglobal stereo matching; RMSE: root mean square error.
Runtime
Dense 3D mapping must run in real-time to be practical in the perception system for a combine harvester. To evaluate the real-time performance of the proposed method, four stereo image pairs with varying resolutions ranging from 0.5 MP to 1.5MP were processed and implemented on a standard CPU. Additionally, a coarse estimation of the runtime of the other method was performed under the same conditions. The average runtime of four stereo images of all methods is reported in Figure 7. The results show that the proposed method can estimate a dense 3D map at 3 Hz on 0.5megapixels resolution images, which is sufficient for online perception of combine harvesters with relatively slow speed (0.8∼2.0 m/s).
A coarse estimation of the execution time (milliseconds) of the principal components was conducted, as shown in Table 3. The computation of dense disparity is the most computationally expensive task in the 3D reconstruction process, depending on the image resolution and the size of the Census transform window. For our method, there is always a trade-off between accuracy and speed, and increasing image resolution and size of the Census transform window will improve performance, but at the cost of a high computational requirement.
Time spent in individual parts.
Qualitative results
Figure 8 illustrates the generated disparity maps for several 3D mapping methods from four different scenarios. While all four methods can reconstruct the typical agricultural scenes, including the edge of cut or uncut, lodging areas and obstacles, our method shows a sensible improvement and is suitable for navigation and harvesting tasks. This highlights the effectiveness of our method in agricultural scenarios, where accurate and efficient 3D perception is critical for the successful operation of combine harvesters and other agricultural machinery.

Qualitative comparison of different 3D mapping methods: the left images of four stereo image pairs (first column), BM (second column), ELAS (third column), SGM (fourth column), and our method (fifth column).
Discussion
The design of a vision-based 3D perception system is deeply task-dependent and requires full consideration of vehicle motion types, environmental characteristics, accuracy requirements, and computational resources. In this work, we aim to reconstruct the 3D geometry of a crop field from a vehicle-mounted stereo camera. We investigate how to extract informative point cues from images of highly textured crop fields and fuse these cues into a Bayesian inference to calculate the disparity value of each pixel. Experimental results demonstrate that our method achieves a good trade-off between accuracy and efficiency.
A shortcoming of our method is the sensing range, which is limited by the fixed baseline length of the stereo camera. One effective solution is to leverage variable baselines in multiview stereo exploiting the vehicle motion, which requires the integration of our algorithm into a VO or SLAM system. Technically, the disparity map can be seen as a 2.5D representation of the crop field. To achieve complete 3D reconstruction and globally consistent representation, we need to compute the extrinsic parameters of the camera and transform all 3D points from camera coordinates to world coordinates. As a feature-based 3D mapping method, the core idea of our method is to leverage a sparse point cloud to infer a dense disparity map, which makes our method easy to integrate into feature-based VO or SLAM systems.
In future research, we intend to optimize frame rates by transitioning from an x86 to an ARM platform with a GPU, enhance algorithmic robustness through diversified data collection and testing under varying conditions, and advance from basic sensing to a more cognitive understanding of the environment by integrating semantic information.
Conclusion
In this article, we propose an accurate and efficient dense 3D mapping method for combine harvesters using stereo cameras, which is essential for autonomous operation. The proposed approach first constructs a sparse 3D map as an a priori representation of the environment and then infers a dense disparity map based on a probabilistic inference model in the second stage. We have compared our method with three other state-of-the-art approaches, and experimental results demonstrate the effectiveness and efficiency of our method in challenging crop field environments. Our method relies solely on a stereo camera operating on a standard PC (without FPGA or GPU acceleration), offering significant advantages in cost, hardware setup, and power consumption. This makes it well-suited for agricultural mobile platforms with limited computational resources. We believe that vision-based perception systems hold great potential for autonomous agricultural vehicles, and we expect our method to be valuable for such mobile platforms.
