Abstract
Introduction
Autonomous driving technology enhances the safety and reliability of engineering vehicles and has garnered significant attention in the domains of resource extraction and transportation. Environmental sensing is crucial for the autonomous operation of engineering vehicles, necessitating access to extensive, accurate environmental data to facilitate obstacle avoidance and local path planning. Currently, widely used environmental sensors include monocular vision, binocular vision, depth cameras, radar, and LiDAR. LiDAR’s most significant advantage lies in its capability to rapidly acquire highly accurate point clouds over extensive areas, independent of lighting conditions. However, classifying and identifying objects based on features like non-uniform density and unstructured distribution remains a challenge in using LiDAR point clouds for environmental perception. The storage and real-time computation of large amounts of point cloud data are additional challenges. (a) High dispersion of data: Point clouds in unstructured environments often exhibit high levels of dispersion due to the variability and unpredictability of terrain features, complicating traditional filtering and segmentation tasks. (b) Noise and clutter: Unstructured settings are prone to noise and clutter from natural elements such as leaves, uneven ground, or non-static objects, which can mislead detection and classification algorithms. (c) Lack of consistency: Unlike structured environments, objects in unstructured environments lack consistency in shape, size and distribution, making the direct application of traditional methods challenging.
Ding et al. 1 developed a dual-frequency continuous-wave radar method for target detection that leverages differential directional radar to integrate target information from various detection angles. This approach enhances the richness of the target data, facilitates the accurate identification of potential moving targets, and effectively suppresses mid-frequency interference, crucial for broad-spectrum detection applications. Yuan et al. 2 introduced a novel obstacle detection and path tracking strategy that combines monocular camera data with ranging radar input. This method, leveraging the integration of laser and vision data, enables mobile robots to track targets effectively while concurrently addressing the challenges of robot localization, target tracking, and map construction. Yoon and Park 3 proposed an ultrasonic localization technique using a genetic algorithm in structured environments, designed to prevent ultrasonic signal collision and thus maximize localization accuracy. Wenzl et al. 4 investigated a LIDAR sensor network tailored for the autonomous tracking of pedestrians within a surveillance area, introducing a sophisticated decentralized track fusion architecture for enhanced multi-target detection and tracking. Vakulya and Simon 5 developed a neural network-based model for acoustic sensor-target detection, designed to compensate for co-measurement and systematic errors, thereby ensuring reliable results even when traditional consistency function-based algorithms fail. Despite these advancements, validations of these target detection and localization techniques in unstructured environments remain unexplored.
In traditional point cloud analysis within unstructured environments, the disorganized characteristics of LIDAR point clouds pose significant challenges to both the accuracy and efficiency of data processing. Moreover, managing the storage and real-time operation of substantial point cloud volumes continues to be a formidable challenge.
Due to significant vibrations generated when engineering vehicles traverse non-structured surfaces (as evidenced by RTK data showing the amplitude of vibrations on such surfaces), the point cloud data collected are highly dispersed. Simple coordinate threshold methods and traditional filtering techniques struggle to reliably filter this data. In response to the challenge of rapidly removing complex noise in point clouds under non-structured conditions, this paper introduces an improved statistical filtering algorithm for point clouds. Within this algorithm, values can be set to control the neighborhood size, standard deviation multiples, and the stringency of selection criteria. It is evident that these settings greatly influence the dispersion and spatial distribution of the processed point cloud (as will be demonstrated in the experimental section).
Machine learning is widely applied in sensor-based object classification and recognition to enhance accuracy.6–9 Accordingly, this study presents a method that utilizes a convolutional neural network (CNN) to segment original LiDAR point clouds into individual targets, followed by feature extraction and classification of these targets. In most environments, objects can be assumed to be perpendicular to the ground, allowing for the separation of ground and non-ground points in the LiDAR data. The remaining non-ground points are then projected onto a plane and clustered into distinct targets; subsequently, this plane is rasterized into neatly arranged cells containing the corresponding scatter points. Connected components are then consolidated, and an inverse mapping from the plane segments unorganized three-dimensional points into distinct objects with unique labels. This method eliminates redundant iterations in the target segmentation process, thereby improving computational efficiency. This study examines three types of objects: trees, pedestrians, and material piles. The proposed feature extraction and classification method is applicable in most unstructured environments and supports decision-making for autonomous vehicles, thus enabling autonomous driving in unknown environments.
The contributions of this paper are as follows:
(1) In response to the challenge of rapidly removing complex noise in point clouds under non-structured conditions, this study introduces an improved statistical filtering algorithm, enabling rapid filtering of point clouds in unstructured environments;
(2) To address the difficulty of rapidly and accurately recognizing three-dimensional point clouds in unstructured environments using traditional methods, this paper proposes a method of projecting three-dimensional point cloud data onto a two-dimensional plane for target cloud detection. Furthermore, a convolutional neural network model specifically designed for target cloud detection is introduced, which demonstrates superior performance compared to other network models;
(3) The proposed methods have been systematically validated and tested through experimentation.
The rest of this paper is organized as follows: Section II presents the related existing research. Section III comprehensively describes the data filtering method for LiDAR point cloud and the construction of a neural network model. Section IV carries out the experimental design and data collection. Section V verifies the effectiveness of the filtering algorithm proposed in this paper by comparing it with other filtering methods and discusses the pros and cons of the proposed neural network-based classification method. Section 6 concludes the paper by summarizing the advantages and disadvantages of the proposed method.
Related works
Point cloud filtering
Research has demonstrated that morphological filters are capable of eliminating target points, and the application of morphological operations with small windows effectively removes minor ground objects, such as individual trees, thereby producing a surface that more closely approximates the ground level. Huang 10 introduced a filtering algorithm that adapts to the inherent properties of point clouds gathered by airborne LiDAR. This method adjusts the filtering window based on point cloud density and gradient differences above and below the surface. However, its reliance on the specific characteristics of 3D point clouds limits its general applicability. Sithole and Vosselman 11 developed a filtering algorithm based on altitude difference segmentation that performs well in structured environments. Yet, in unstructured settings, the significant variance in altitude differences within the 3D point clouds leads to suboptimal outcomes. Furthermore, a progressive morphology filter was introduced in 12 for isolating non-terrestrial LIDAR signals by incrementally increasing the filter window size. This technique, which applies a threshold of altitude difference to exclude vehicles, vegetation, and buildings while preserving the ground, was evaluated using datasets from both mountainous and urban environments. Despite its advancements, this method still suffers from certain inaccuracies, including false positives and omissions.
Point cloud segmentation
Chen et al. 13 divided a complete LiDAR point cloud scene into uniformly distributed 3D voxels and applied feature coding to elucidate the characteristics of each voxel. Yang et al. 14 encoded each voxel as a placeholder and projected oriented 2D bounding boxes in aerial views of LiDAR data. Hao and Wang 15 developed an object classification algorithm tailored for complex environmental scenes, employing Gaussian mapping to segment the point cloud and reconstruct scene topology. However, this method was limited to objects composed solely of planes and overlooked the occlusion challenges inherent in 3D objects. Yang et al. 16 introduced a semantic feature point alignment method, utilizing intersections of feature lines with the ground as semantic points and combining geometric constraints with semantic information for feature point matching to achieve alignment. Despite its sophistication, this approach lacks a consistent method for global alignment due to the complexity of semantic feature point extraction. Hamraz et al. 17 proposed a tree segmentation strategy using digital surface models to differentiate the point cloud into upper and multiple understory layers by analyzing the vertical distribution of overlapping LiDAR points. Broggi 18 estimated the absolute velocity of targets through visual range estimation, constructed a voxel-based comprehensive 3D map from sampled point clouds, segmented it into clusters using voxel filling, and labeled these clusters as stationary or moving targets. However, vision-based 3D engine implementations are hampered by visibility constraints, resulting in sparse parallax maps. Zhao et al. 19 implemented a geometric segmentation algorithm to distinguish between target and ground areas in LiDAR data, then deeply classified corresponding images captured by the camera using a fuzzy logic inference framework to integrate LiDAR data with imagery for frame-by-frame analysis. Zeng et al. 20 identified candidate keypoints with high local heteroskedasticity values by computing shape indices and dual Gaussian weighted metrics for each 3D point, facilitating 3D model identification and alignment. This approach was effectively employed by numerous teams during the 2007 DARPA Urban Challenge to segment point clouds and detect vehicles on the track.21–23 Himmelsbach et al. 24 devised a rapid segmentation method for extensive long-range 3D point clouds, enabling local ground plane estimation and swift 2D connected component labeling by splitting the problem into two simpler subproblems and projecting 3D points onto a 2.5D mesh anchored to the ground.
A. CNN
In recent years, the landscape of feature extraction in machine vision has been significantly enriched by the advent of deep learning technologies, including autoencoders, convolutional neural networks (CNNs), restricted Boltzmann machines, and deep networks.25–27 These methods have become pivotal in enhancing the accuracy of sensor-based object classification and identification within the field of object recognition.
Zeng et al. 28 demonstrated the effectiveness of a CNN-based multi-feature fusion learning method specifically tailored for the retrieval of nonrigid 3D models. Wang et al. 29 developed a novel graphical convolutional kernel that selectively concentrates on the most pertinent segments of a point cloud, capturing essential structural features for semantic segmentation. Pang and Ulr 30 advanced 2D classification techniques for point clouds using CNNs, standardizing the size of training samples to ensure uniform processing across populated boundaries, thereby enabling the classifier to scan for all object classes within a consistently sized window.
Further, Song et al. 31 transformed target point clouds into a Hoff space using a Hoff transform algorithm, subsequently rasterizing them into a series of uniform meshes. They then quantified the accumulator in each mesh, employing CNNs for the classification of 3D objects. Rangel et al. 32 leveraged spatial information within 3D data to segment targets in images and conducted semi-supervised learning on each targeted image. This approach utilized the robust classification capabilities of CNNs to effectively generalize categories characterized by high intra-class variation.
Additionally, a novel nonrigid CNN-based multi-feature fusion learning model was introduced, 33 highlighting a growing trend in 3D object classification that integrates various point cloud representations and CNN models to generate comprehensive identifying information about objects. 34
Methods
The operating environments of construction vehicles are typically complex and unstructured, where targets often lack regular shapes, colors, textures, and other distinct features, complicating the task of target recognition. Additionally, rigorous driving practices induce significant wide-band vibrations in construction vehicles, which introduce substantial noise into the perceptual data collected by sensors such as LiDAR, cameras, and range sensors. This noise substantially interferes with the accuracy of target classification.
To address these challenges, this study first implements a denoising process on the collected data to enhance data quality. Subsequently, the cleaned data is used to identify and classify targets. Specifically, this paper applies convolutional neural networks (CNN) to classify point cloud data representing trees, pedestrians, and piles, which are common elements within construction sites. The effectiveness of this approach is validated through experimental methods that are designed to test the robustness and accuracy of the CNN model under the challenging conditions typical of construction environments.
Proposed framework overview
In response to the challenges of target recognition in unstructured environments characterized by significant noise, this study introduces a hybrid model that integrates an enhanced filtering algorithm with convolutional neural networks (CNNs), as detailed in Figure 1. The model comprises four primary components:
1) Data acquisition. A 16-wire LiDAR is used for point cloud data acquisition. This paper aims to classify the point cloud in unstructured environments. Thus, the RTK is employed to collect the variation of the three-axis angle of the vehicle to illustrate the complexity of the environment and to prepare for the next step of processing.
2) Point cloud filtering. A reasonable filtering process is needed to prepare for the subsequent segmentation because of the high dispersion of the point cloud collected in unstructured environments. The traditional statistical filtering algorithm cannot achieve satisfactory results for discrete point clouds; therefore, this paper will propose an improved filtering method.
3) Point cloud segmentation. Firstly, the ground is separated from the filtered point cloud. Then, the point cloud is projected into the x–z plane to obtain target clusters and their 2D geometry features. Finally, the targets are labeled.
4) Classification. CNNs consist of multiple convolutional and pooling layers with a deep architecture that can automatically extract key features of the data and reveal the characteristics of the source data. In this paper, we will take advantage of this capability of CNNs to accomplish the classification requirements by training the neural network based on the processed point cloud.

Framework of the proposed method.
Each section is discussed in detail below.
Data acquisition
The data acquisition system employed in this study is comprised of a loader, a LiDAR, a Real-Time Kinematic (RTK) system, an acquisition card, and a computer. The LiDAR is directly interfaced with the computer, while the RTK system connects through the acquisition card. To assess the robustness and applicability of the proposed algorithm, experiments were designed under six distinct conditions in an unstructured environment, each varying in velocity and route; detailed descriptions of these conditions are provided in the Experimental Section.
LiDAR point clouds serve as the primary data source for training the neural network and validating the algorithm’s performance. Concurrently, vehicle condition data, including pitch, roll, and yaw angles, are utilized to delineate the contrasts between structured and unstructured environments where engineering vehicles operate. The high scanning frequency of the LiDAR yields tens of thousands of data points per second. In an effort to optimize computational resources and enhance processing efficiency, only the point cloud data within a 30-m radius centered on the LiDAR’s location is retained, effectively excluding points outside this defined area.
Point cloud filtering
A construction vehicle driving on an unstructured pavement will generate large vibration (the vibration level of the pavement can be determined from the RTK data), thereby leading to a large dispersion of the collected point cloud, which is difficult for the coordinate threshold method and the traditional filter method to filter reliably. Thus, the point cloud is processed by using the improved statistical filter algorithm in this paper. The specific improvements are as follows.
Assume that the point cloud set is
In the statistical filtering algorithm,
Consequently, any points in a point cloud whose distances from their neighbors do not fall within the range defined by (
In response to the distinct noise characteristics typical in unstructured environments, this study advances the conventional single-pass statistical filtering approach to a multi-iteration filtering algorithm. This enhancement allows for varying levels of noise reduction by adjusting k and
The point cloud dataset
Improved filtering algorithm.
Contrary to the conventional statistical filtering algorithms, the enhanced algorithm proposed in this study conducts multiple rounds of statistical filtering on highly dispersive noise within 3D point clouds. It regulates the iterations of filtering through a standard deviation threshold based on the Euclidean distance. This method demonstrates significant improvements over traditional statistical filters, yielding more precise outcomes. Furthermore, it establishes optimal conditions for subsequent point cloud segmentation and feature extraction processes in unstructured environments.
Point cloud segmentation
In the context outlined, a pragmatic strategy involves approximating most physical entities as orthogonal to the terrestrial plane. Accordingly, the initial step entails segmenting the tridimensional point cloud by projecting it onto the x–z plane. This projection yields a contiguous plane of ground points, while other objects manifest as distinct components within the point cloud projection. Subsequently, employing a specified threshold, the ground points are sieved in the x–z plane, thereby filtering them effectively.
To isolate non-ground points into discrete, interconnected clusters, a methodical approach involves rasterizing the projection points into uniform square units. These units are subsequently amalgamated into autonomous objects, discerned based on the salient geometric attributes projected onto the x–z plane. Following this segmentation process, the identified clusters are then reconstituted into their native tridimensional configurations, with accompanying categorical labels.
Target detection
CNN has unparalleled advantages in deep learning tasks with image as input, however, to get high precision results, a huge amount of data is required, and the dataset in this paper is obviously difficult to train a brand new large convolutional network, so this paper designs the network as a 25-layer structure according to the realistic dataset, and its structure is shown in Figure 2.

Proposed network specific structure.
In addressing deep learning tasks involving image inputs, Convolutional Neural Networks (CNNs) exhibit exceptional capabilities. However, achieving high-precision outcomes typically necessitates extensive datasets. Given the constraints posed by the limited dataset described in this study, it is impractical to train a completely new, expansive CNN from scratch. Therefore, we have designed a tailored 25-layer CNN architecture, optimally suited to our dataset’s scale. The configuration of this network is detailed in Figure 2.
The network consists of a 25-layer network with 5 convolutional layers, 5 pooling layers, 7 activation layers, 8 fully connected layers, 1 Softmax layer, 2 normalize layers, and 7 dropout layers (all drop odds are set to 0.5). The size and number of layers of the convolutional kernel can be read off from the Figure 2. The network consists of 25 layers, including 5 convolutional layers, 3 pooling layers, 7 activation layers, 3 fully connected layers, 1 softmax layer, 2 normalization layers, and 2 discard layers (the discard probability was set to 0.5). Among them, The first layer Conv1 convolution kernel size is 11 × 11, stride is 4, padding is,1,2 subsequently Maxpool1 parameters are set respectively, kernel size is 3 × 3, stride is 2, padding is 0,, and the second layer Conv2 convolution kernel size is 5 × 5, stride is 1, padding is 0. stride is 1, padding is,[2,2] Maxpool2 parameters are set respectively, kernel size is 3 × 3, stride is 2, padding is 0, the third layer of Conv3 convolution kernel size is 3 × 3, stride is 1, padding is,[1,1] followed by the fourth and fifth layers of Convolution, where two layers of convolution parameters are designed kernel size is 3 × 3, stride is 1, padding is,[1,1] Maxpool2 parameters are set respectively, kernel size is 3 × 3, stride is 2, padding is 0. Finally, three fully connected layers are connected. as shown in Table 1.
Specific parameters of the proposed network design.
The design rationale is (1) the addition of the Relu activation function
Where:
When the model is being trained, multicategory cross entropy is chosen as the
where
Experiment
Although the purpose of this paper is to improve the sensing capability of unmanned engineering vehicles, no fully functional unmanned engineering vehicles exist; thus, the experiments were conducted on a manned engineering vehicle. In addition, for safety reasons, the experiments were conducted on a closed construction site with both pavement and native pavement, as shown in Figure 5.
Experiment setup
The experimental setup, as depicted in Figure 3, employed a ZL10 loader as the construction vehicle. The instrumentation included a LiDAR sensor for environmental detection and an RTK sensor for monitoring vehicle conditions, detailed in Tables 2 and 3, respectively. Both sensors were strategically mounted atop the loader to ensure optimal data acquisition, as illustrated in Figure 3(a). The LiDAR unit was interfaced directly with a computer, capturing high-resolution point clouds in the ROS environment. In parallel, the RTK system, connected through an acquisition card, recorded and processed vehicle dynamics data using LabVIEW. This setup facilitated the precise tracking of local coordinates and the loader’s orientation—roll, pitch, and yaw—as shown in Figure 3(b), thereby indirectly mapping the variability of the road surface conditions.

Experiment setup: (a) is the experimental vehicle, (b) is the local coordinate system and the vehicle-related state (
Parameters of LiDAR.
Parameters of RTK.
The experimental field is the typical unstructured scenario shown in Figure 4. In this study, to evaluate the effectiveness of the filtering algorithm and to collect as many samples as possible to train the neural network, the vehicle is driven on different paths at various speeds to obtain several vibration levels. The LiDAR data obtained under each operating condition are frame intercepted, filtered, point cloud segmented, and then used for neural network training.

Unstructured experimental field and its point cloud.
Considering the vehicle’s driving ability, scenario limitations and safety requirements, the selected experimental speeds range from 1.7 to 3.7 km/h for the six experimental conditions, respectively, and the driving routes are a combination of straight and curved paths. The different routes for each of the six conditions are shown in Figure 5, starting at the upper-left corner and ending at the lower-right corner of the experimental field. Although the routes of the first four conditions are similar, the stochastic character of the unstructured environment makes the paths between the conditions not identical. The six conditions have different travel routes and velocities, so the dispersion of the resulting point cloud is not the same.

Experiment data: (a) is the driving routes for the six experimental conditions, (b–d) are the variations of pitch, roll, and yaw with time respectively.
The routes of the last two conditions completely avoid the routes of the first four conditions. The data from the first four conditions are included as the training set for the neural network, and the data from the last two conditions are used as the validation set to avoid the possibility of having data in the validation set that are similar to the data in the training set. In the filtering process, data from all operating conditions are engaged in the analysis.
Results analysis
Comparison of filtering effects
The point clouds are processed by traditional and improved statistical filtering algorithms @
The filtering results are shown in Figures 6 to 11; (a) is the original point clouds, which are considerably dispersive, especially in the regions framed by ellipses, rectangles, triangles and circles, and we will focus on the performance of the algorithms in these regions. (b) shows the results after processing by the traditional filtering algorithm and (c) shows the results after processing by the algorithm in this paper. In Fig., circles represent regions where both algorithms obtain favorable results; ovals represent regions where an improved algorithm can yield only satisfactory results; rectangles represent regions where neither algorithm gives promising results, but the present algorithm is better than the traditional algorithm; and triangles represent regions where the proposed algorithm is worse. As can be seen in the Figures 8 and 9 conditions 3 and 4 have weak results possibly because of the high velocities of the vehicle and the large fluctuation of the road surface, thereby making the dispersion of the original point cloud too large, causing considerable difficulty in filtering.

Comparison of the filtering results under the condition I: (a) is raw data, (b) is the filter result of the original algorithm, and (c) is the filter result of the improved algorithm.

Comparison of the filtering results under the condition II: (a) is raw data, (b) is the filter result of the original algorithm, and (c) is the filter result of the improved algorithm.

Comparison of the filtering results under the condition III: (a) is raw data, (b) is the filter result of the original algorithm, and (c) is the filter result of the improved algorithm.

Comparison of the filtering results under the condition IV: (a) is raw data, (b) is the filter result of the original algorithm, and (c) is the filter result of the improved algorithm.

Comparison of the filtering results under the condition V: (a) is raw data, (b) is the filter result of the original algorithm, and (c) is the filter result of the improved algorithm.

Comparison of the filtering results under the condition VI: (a) is raw data, (b) is the filter result of the original algorithm, and (c) is the filter result of the improved algorithm.
Table 4 and Figure 12 show the number of raw point clouds and the number of point clouds obtained with traditional and improved statistical filtering for each operating condition.
Comparison of filtering results.

Comparison of the average data volume of each working condition.
The point cloud processed by the improved algorithm is below 85% of the data volume compared with the traditional algorithm, reaching a minimum of 68.05%, suggesting that the improved algorithm is more efficient and robustness. The velocities of conditions III and IV are greater than those of the other four conditions, and the pitching, yawing, and rolling values fluctuate widely; thus, the filtering is not as effective as it should be
Segmentation
Figure 13 shows the results of 3D LiDAR point cloud segmentation in the experimental scenario with different types of non-terrestrial objects rendered in different colors. The unstructured environment in which most construction machinery operates typically includes objects such as trees, pedestrians and stockpiles; thus, the goal of this part is to identify these three types of objects. In accordance with the segmentation result, the target features are extracted from the point cloud projected onto the x-z plane by iterating over all target points, as shown in Figure 14, where (a) to (c), (d) to (f), and (g) to (i) are the examples of stockpiles, pedestrians and trees respectively.

Point cloud clustering in different scenarios.

Target point cloud segmentation.
Comparison of filtering algorithms
To further validate the superiority and adaptability of the method proposed in this paper, we compared it with the latest related methods focusing on the average error rate and execution time as shown in Figure 15. The average error rate was computed by averaging the results of 30 runs under each condition, as shown in the table. Although our model does not exhibit the best execution time—it ranks second—it meets the processing speed requirements. Importantly, our model achieves the best average error rate, demonstrating its superiority.

Dynamic error rate statistics of the model under different working conditions.
To further illustrate the error rates processed by different models, we conducted a dynamic error rate analysis under various conditions, as depicted in the figure. The results reveal that T. Yang’s model exhibited the highest dynamic error rate, exceeding 50%, followed by R. Heinzler’s model, which surpassed 45%. The model by Yan Zhi also reached 40%, whereas our model maintained a maximum error rate of only 15%, significantly lower than the other models. This further underscores the efficacy of our algorithm in handling point clouds in unstructured environments.
Model training
After the segmentation, the results are manually labeled to form a sample set. The training set is the point cloud for conditions I to IV, which contains 918 target objects consisting of 412 trees, 151 pedestrians, and 355 piles. The point cloud data from condition V and VI are used as the test set, which contains 640 objects consisting of 312 trees, 103 pedestrians, and 225 piles.
With the hyperparameters set to a learning rate of 10−4, a batch size of 40, and MaxEpochs of 20, the final prediction accuracy is 98.0% and the loss is 0.008 after 1200 iterations, as shown in Figure 16.

Training process of the proposed network.
In this study, the network demonstrated varied object recognition capabilities, achieving accuracies of 98.64%, 97.96%, and 97.47% for trees, piles, and pedestrians, respectively. The superior performance in identifying trees and piles can be attributed to their more distinct geometric features, coupled with a higher prevalence of these objects in the training dataset. Conversely, the recognition of pedestrians proved slightly less accurate, reflecting the inherent challenges posed by their variable appearances and poses. Overall, the model achieved an average recognition accuracy of 98.0% across these three categories.
In order to further verify the influence of the improved filtering algorithm in this paper on target point cloud detection, this part carries out network training on the point cloud data after traditional filtering and compares it with the improved filtering algorithm. The results are shown in Figure 17. Compared with traditional filtering algorithms, the detection accuracy of the proposed algorithm for trees, material piles and pedestrians is improved by 6.33%, 8.236%, and 6.93% respectively. The possible reason is that the characteristics of the material pile are not obvious and easy to be affected by the surrounding noise. Therefore, it is necessary to improve the filtering algorithm in this paper.

Comparison between the original filter and the filter in this paper.
Comparison with other models
To complete the task of the CNN, the popular network is modified minimally in this paper (only the softmax and classification after the last fully connected layer are removed, and a three-output fully connected layer with the same parameters as the final fully connected layer is added, while the structure, parameters and weights of other layers are kept unchanged). The training process is shown in Figure 17. A comparison of the important parameters is shown in Table 5. From Figure 18, we can see that the six models have high accuracy, but some difference in details still exist, and the accuracy of each model is different when it reaches above 97%. The proposed model varies above 98%, while the other five models do not reach 98%, with Inception v2 being the lowest at around 97.7%. Therefore, the present model outperforms the other models in terms of accuracy. Regarding the loss, the six models all have rather small values, with ResNet56, Vgg16 and Alexnet fluctuating above 0.02; GoogLeNet and Inceptionv2 varying between 0.02 and 0.01, respectively; and the proposed model ranging between 0.015 and 0.005. Therefore, the present model also outperforms the other models in terms of loss. As shown in Table 6, our model was preferred over the other models in regards to accuracy, loss and time.
Comparison of the performance of different models.

Training process of different models.
Comparison of different models.
We conducted performance evaluation experiments for each network model and explored the performance of each model in terms of false positive rate (FP), recall rate (RC) and accuracy rate (PR) metrics. The results are shown in the following Table 7. And the table was added to the manuscript and highlighted. As can be seen from the table, we have bolded the two best performances of the three metrics, and it can be seen that the model of this paper is the best performer, where the false positive rate is 9.67%, the accuracy rate is 94.25%, and the recall rate is 93.62. Once again, it proves the superiority of the model of this paper.
Performance statistics of each model.
Comparison with existing models
Table 8 shows the LiDAR point cloud classification models proposed in recent papers. The experiment was conducted under different conditions with different samples and computer configurations, and each article has a different research interest and focus; thus, a direct comparison of accuracy is not worthwhile and is not the real purpose of Table 7. From this comparison, we can conclude that the proposed model is capable of classifying targets in unstructured environments.
Summary and comparison of current obstacle detection models.
Discussion and prospects
From the comparison of the results, we can conclude that the proposed classification model for unstructured environments based on improved filtering algorithms and neural networks is valid. The model consists of four parts: data acquisition, point cloud filtering, p segmentation and classification. Firstly, LiDAR and RTK are employed to collect unstructured environmental data; secondly, the point cloud data are filtered to remove irrelevant noise, improve data availability, speed up network training and improve model classification accuracy. Then, the point cloud is projected to the x–z plane to form a number of target clusters and their 2D geometry features, and these clusters are backprojected to 3D coordinates to find the corresponding 3D point cloud for each cluster. Then, the volume, density, and other features of each target are derived. Finally, the obtained training and test sets are applied for transfer learning on an improved CNN for target detection. To verify the performance of the above classification model, experiments are designed and a large amount of data are recorded to train and test the model. The results show that the proposed algorithm has good classification accuracy, reliability and efficiency.
Nevertheless, the proposed model has some areas for improvement. Firstly, the real-time performance of the model has not been validated because the data are processed offline in this paper. Secondly, the algorithm is robust in most cases, but its robustness under high-speed conditions has to be improved, which is also a future task. The following is a vision for future work
The following is a vision for future work: (a). Development of Advanced Filtering Algorithms: We plan to advance the development of more efficient filtering algorithms capable of addressing the high variability and noise typical in unstructured environments. This includes exploring adaptive filtering techniques that dynamically adjust according to the characteristics of the data, potentially improving precision and computational efficiency. (b). Integration of Machine Learning and AI: Our future research will delve deeper into the integration of cutting-edge machine learning and artificial intelligence strategies to enhance point cloud recognition accuracy in unstructured settings. Specifically, we aim to leverage deep learning methodologies to augment feature extraction and object classification in complex scenarios. (c). Standardization of Unstructured Environment Data: We intend to establish comprehensive datasets specific to unstructured environments of engineering vehicles, alongside standardized processing and evaluation protocols. This will facilitate consistent and comparable results across various studies and applications, thereby enhancing reproducibility and collaborative research efforts.
