Abstract
Keywords
Introduction
Localisation and 3D reconstruction are fundamental problems in both robotics and computer vision, with applications spanning autonomous driving, building inspection and augmented reality. There are methods that focus on localisation (e.g. visual or lidar odometry, Structure-from-Motion (SfM), place recognition and relocalisation) and others that focus on 3D reconstruction/mapping (e.g. Multi-view Stereo (MVS) and occupancy mapping). In mobile robotics, both problems can be solved concurrently by Simultaneous Localisation and Mapping (SLAM) methods, which are our primary focus in this work. For large-scale outdoor environments, cameras and LiDARs are the most commonly used exteroceptive sensor modalities for environment perception tasks. Additionally, the Inertial Measurement Unit (IMU) is a common interoceptive sensor. The two mentioned exteroceptive sensor technologies have complementary characteristics: LiDAR captures long-range depth measurements that are accurate but sparse, while camera images capture texture with higher resolution.
The evaluation of the outdoor SLAM systems has primarily focused on localisation accuracy, conversely quantitative evaluation of the 3D reconstruction quality is often lacking. An important reason for this is the limited availability of high-quality ground truth. For 3D reconstruction, ground truth is typically collected using survey-grade Terrestrial LiDAR Scanners (TLS) which are expensive (Zhang et al., 2022). Compared to indoor scenes where MVS and RGB-D SLAM systems are often evaluated, the large-scale of outdoor environments makes TLS data collection laborious. As a result, many outdoor SLAM datasets do not include precise ground truth reconstruction from TLS and rely on ground truth trajectories from other sensors such as GNSS-RTK (Geiger et al., 2013).
In addition to geometric reconstruction, colour reconstruction is becoming more important with the advances of radiance field methods including Neural Radiance Fields (NeRF) (Mildenhall et al., 2021) and 3D Gaussian Splatting (Kerbl et al., 2023). These methods take as input calibrated camera images and their precise 3D poses (typically estimated using SfM), and output a dense 3D field with volume density (similar to differential opacity) and view-dependent colour. The output radiance field can be used to synthesise photorealistic images using volume rendering techniques. Since radiance field methods are capable of representing complex geometry and appearance, some SLAM systems have adopted it as their underlying 3D map representation (Sucar et al., 2021; Zhu et al., 2022).
Despite the rapid development of radiance field methods, their use in outdoor mobile robot perception has been less well explored. Radiance field methods are often evaluated by the quality of images rendered using datasets where the image set points at a single object observed in controlled lighting conditions, and often indoors. For a mobile robot operating in an outdoor environment, the trajectory is usually not object-centric, and viewpoints are relatively sparse compared to the size of the scene. Inferring 3D structure from monocular images alone is more challenging if provided with fewer viewpoint constraints, and this can lead to artefacts (e.g. the elongated Gaussians along the viewing direction mentioned in Matsuki et al. (2024a)) that are not noticeable if only evaluated from nearby poses. In addition, radiance field methods can also generate ‘floater’ artefacts to overfit per-frame lighting conditions (as discussed in Tancik et al. (2023)) and the texture of the sky. Both are common challenges in outdoor environments. Such artefacts can lead to inferior 3D reconstruction and poorer photo-realistic rendering from a pose that is far from the training sequence. To develop radiance field methods that can be integrated with outdoor SLAM systems, it is crucial to have a dataset with colour images, LiDAR and accurate ground truth trajectory and reconstruction.
In this work, we introduce the Oxford Spires Dataset, a large-scale dataset collected across six historical landmarks in Oxford, UK, covering more than 20 000 m2 per site. The total area recorded in this dataset is more than 125 000 m2, about the size of a small town. It provides high-resolution RGB image streams from three cameras, 3D wide Field-of-View LiDAR data, and inertial data from a mobile handheld device. It is accompanied by millimetre-accurate reference scans (Figure 1) which serve as the reference ground truth 3D model for reconstruction systems. We also use it to determine the ground truth trajectories of the handheld device. The setup of three colour cameras facing forward, left and to the right is a particularly novel characteristic as similar datasets contain TLS-based ground truth, LiDAR scans but only forward-facing colour camera(s) (Liu et al., 2024; Nguyen et al., 2024; Wei et al., 2024). The side-facing cameras provide increased Field-of-View that not only increases the texture mapping coverage but also provides view constraints which are crucial for vision-only systems to infer the 3D structure. The three-camera setup also makes textured mapping more tractable from a simple linear pass through an environment, rather than requiring exhaustive scanning. This makes our dataset suitable for evaluating radiance field methods in outdoor mobile robotics and reality capture contexts. Leveraging this rich combination of sensor data, we also introduce three benchmarks for localisation, 3D reconstruction, and novel-view synthesis. We use the benchmark to evaluate state-of-the-art SLAM systems, SfM-MVS systems and radiance field systems. In particular, the novel-view synthesis benchmark features test data not only sampled from a single reference trajectory, but also from other sequences where the device travelled in an opposing direction and along a trajectory far from the reference. The evaluation results of state-of-the-art radiance field methods highlight the problem of overfitting to the training data and an inability to generalise to distant viewpoints. Our dataset opens up new research avenues in this space. We release the raw sensor data as well as processed data including example outputs from a LiDAR SLAM system (such as motion-undistorted point clouds) and a SfM system (which can be used for MVS, NeRF and 3D Gaussian Splatting), as well as ground truth trajectories and reconstruction. Software for parsing the data and evaluating the systems presented in the three benchmarks is also made available.
In summary, our main contributions are as follows: • A large-scale outdoor dataset collected at six historical sites, covering an average area of about 10 000 m2 each. In total, 24 sequences were recorded, with the average distance travelled in each sequence exceeding 400 m. • The dataset is collected with a sensor suite comprising three 1.6 megapixel global shutter fisheye RGB cameras, a wide Field-of-View 64-beam 3D LiDAR, and inertial data, paired with millimetre-accurate reference 3D models captured using a TLS. • We provide precise calibrations for the synchronised sensors – including three fisheye cameras, IMU and the 104° FoV LiDAR. • Three benchmarks for localisation, reconstruction and novel-view synthesis with ground truth generated using the 3D models from the TLS. In this paper, we evaluated state-of-the-art SLAM, SfM, MVS, NeRF and 3D Gaussian Splatting methods for each. • Evaluation software is released for using the dataset and benchmarking methods.
Related work
Summary of related datasets for testing localisation and reconstruction methods. Oxford Spires is a large-scale outdoor SLAM dataset with three colour camera images, LiDAR as well as ground truth trajectories and 3D models. It can be used to evaluate tasks including localisation, reconstruction and novel-view synthesis. Some features in other related datasets, including indoor scenes, greyscale camera images, short range (<10 m) depth sensing, and imprecise or missing ground truth 3D models, are not suitable to our target domain (outdoor SLAM with colour reconstruction).
Datasets for evaluating localisation
Localisation is a key task in robotics and computer vision, and is performed by methods including odometry, SLAM, SfM, place recognition, or relocalisation in a prior map. In indoor environments, cameras and RGB-D sensors are commonly used. TUM RGB-D (Sturm et al., 2012) is one of the first benchmarks which sought to evaluate localisation performance using ground truth trajectories. While for visual-inertial SLAM systems, EuROC (Burri et al., 2016) and TUM VI (Schubert et al., 2018) are popular datasets used in the research community. For outdoor environments, LiDAR is a common sensor modality and has been used in robotics datasets such as New College (Smith et al., 2009) and NCLT (Carlevaris-Bianco et al., 2016). Other datasets focus on evaluating odometry and SLAM trajectories in the context of autonomous driving, including KITTI (Geiger et al., 2013), Complex Urban (Jeong et al., 2019), and WoodScape (Yogamani et al., 2019). Several datasets including RELLIS-3D (Jiang et al., 2021), Botanic Garden (Liu et al., 2024), and WHU-Helmet (Li et al., 2023) focus on unstructured natural environments (forest and rural areas). Robot platforms were used to collect datasets including quadrupeds (Brizi et al., 2024; Wei et al., 2024) and aerial robots (Li et al., 2024).
To evaluate the accuracy of localisation systems, a precise ground truth trajectory is essential. For self-driving datasets, ground truth trajectories are often obtained by fusing GNSS data with inertial and LiDAR data (Geiger et al., 2013). One limitation of GNSS-based ground truth is that it is not reliable in areas such as urban canyons. Motion capture systems can also be used to obtain ground truth trajectories (Helmberger et al., 2022), although they are often limited to indoor environments. In outdoor environments, Newer College (Ramezani et al., 2020b) generates a centimetre-accurate ground truth by registering LiDAR scans against an accurate prior map obtained using TLS. The approach of registering mobile LiDAR scans to an accurate prior map was also adopted by the authors of BotanicGarden (Liu et al., 2024) and MCD (Nguyen et al., 2024). Hilti-Oxford (Zhang et al., 2022) is notable in achieving millimetre accuracy ground truth for a sample set of stationary poses by using reference targets. Our work follows the approach used in Newer College (Ramezani et al., 2020b) to generate dense ground truth trajectories.
Datasets for evaluating 3D reconstruction
SLAM systems estimate both a robot/sensor trajectory and the map of their environment; however, many SLAM datasets only provide ground truth trajectories. Few datasets evaluate the accuracy of the map reconstruction, because accurate ground truth reconstructions are costly and laborious to obtain. Because of this, some SLAM datasets such as ICL-NUIM (Handa et al., 2014) create ground truth 3D models using simulation, and other datasets including Matterport3D (Chang et al., 2017) and ScanNet (Dai et al., 2017) actually use the output from a RGB-D SLAM system as ground truth. Replica (Straub et al., 2019) provides higher-quality 3D meshes than ScanNet and Matterport 3D, and the rendered images are photo-realistic. The RGB-D SLAM ground truth approach cannot be adapted to outdoor scenes due to the short range of depth cameras. LaMAR (Sarlin et al., 2022) includes outdoor sequences and a ground truth 3D model from a combination of VIO, SLAM and SfM. The ground truth 3D model obtained with SLAM is generally not as accurate as what TLS can produce. Survey-grade TLS achieves millimetre-level accuracy (A comparison can be found in ScanNet++ (Yeshwanth et al., 2023)).
Among the datasets that provide precise ground truth reconstruction (obtained from TLS), EuROC (Burri et al., 2016) and ScanNet++ (Yeshwanth et al., 2023) are captured from indoor environments, and hence LiDAR is not used. The only available outdoor SLAM datasets that include accurate ground truth 3D models (to the best of our knowledge) are Newer College (Ramezani et al., 2020b) and Hilti-Oxford-2022 (Zhang et al., 2022). Both datasets use relatively low-resolution greyscale cameras, and therefore are not suitable for colour 3D reconstruction. Compared to them, our dataset provides high-resolution colour images from three cameras, and is hence suitable for not only 3D reconstruction but also novel-view synthesis.
In the field of computer vision, datasets with accurate ground truth reconstruction exist for MVS research, but often they target small-scale indoor scenes. Middlebury (Seitz et al., 2006) was one of the early datasets with ground truth depth obtained using a structured light scanner. DTU (Aanæs et al., 2016) captured individual objects using a robotic arm in a controlled environment, with its ground truth also obtained using structured light scans. ETH3D (Schops et al., 2017) provides both high-resolution images (<80 per sequence) recorded by a DSLR camera, and low-resolution synchronised grey-scale images (∼1000 per sequence), with a ground truth 3D model obtained using a TLS. Tanks and Temples (Knapitsch et al., 2017) is another popular benchmark for 3D reconstruction with ground truth from TLS. It can be used to evaluate both SfM and MVS algorithms and uses a higher-quality camera for its video data.
Datasets for evaluating novel-view synthesis
Radiance fields have emerged as the most promising representation for novel-view synthesis. The input images for radiance field methods are often co-registered using SfM methods such as COLMAP (Schönberger and Frahm, 2016). In the original NeRF paper (Mildenhall et al., 2021), both synthetic datasets and real-world image sequences from LLFF (Mildenhall et al., 2019) were used. Subsequently, Mip-NeRF 360 (Barron et al., 2022) included object-centric framed images taken in indoor and outdoor environments, and is popular in the radiance field community. Radiance field methods are typically evaluated using test set images that are sampled from an input trajectory and excluded from training. ScanNet++ (Yeshwanth et al., 2023) used a more challenging evaluation approach. The authors capture test images independently from the training sequence using a higher-quality DSLR camera in the indoor environment. In contrast, our dataset focuses on large-scale outdoor environments, and provides test images captured from sequences with distant viewpoints. In this manner, we aim to advance the generalisation capability of existing radiance field methods.
Hardware
Handheld perception unit
Our perception unit, called Frontier, has three cameras, an IMU and a LiDAR. It is shown in Figures 2 and 3 (right). The three colour fisheye cameras face forward, left, and right from a customised Alphasense Core Development Kit from Sevensense Robotics AG. Each camera has a Field-of-View of 126 ° × 92.4 ° with a resolution of 1440 × 1080 pixels. There is a cellphone-grade IMU in the Alphasense Core Development Kit, which is hardware-synchronised with the three cameras using an FPGA from Sevensense. The three cameras have around 36° overlap, which enables multi-camera calibration mentioned in Section “Multi-Camera Intrinsic and Extrinsic Parameters Calibration”. The cameras operate at 20 Hz, and the IMU operates at 400 Hz. Auto-exposure is enabled for the cameras to capture indoor and outdoor scenes with different lighting conditions. A 64-channel Hesai QT64 LiDAR operating at 10 Hz was mounted on top of the cameras, with a Field-of-View of 104°, and a maximum range of 60 m. The LiDAR’s accuracy is ± 3 cm (typical), and the relation between reflectance and intensity is An isometric view of the sensor setup highlighting the coordinate frames of the cameras, the IMU and the LiDAR. Leica RTC360 TLS and the Frontier device in Blenheim Palace (left) and Christ Church College (right).

Multi-sensor synchronisation is achieved both on the hardware level and software level. In terms of software synchronisation, we synchronise the device clock of Alphasense Core Development Kit (with the three cameras and IMU) and the Hesai LiDAR to the host computer on the Frontier using the Precision Time Protocol (PTP) which achieves sub-microsecond accuracy. As for hardware synchronisation, the three cameras are synchronised with the IMU within the Alphasense Core Development Kit, and we synchronise a LiDAR point cloud with the cameras using motion undistortion. While the three cameras’ exposure times are not identical for one shutter cycle due to the auto-exposure, the exposure intervals of the cameras are aligned in the middle. 1 This means that the timestamps of the mid-frame of the exposure interval of each camera are identical. This timestamp is used as the timestamp for the camera image. The Hesai QT64 LiDAR is a rolling shutter sensor, and the point cloud data is not triggered at a specific timestamp but is obtained continuously. To obtain a synchronised LiDAR point cloud for a synchronised set of three camera images, we motion-correct a LiDAR point cloud with IMU preintegration using VILENS (Wisth et al., 2023). The point cloud is undistorted to the same time as the next camera image frame. More information regarding the motion undistortion can be found in Wisth et al. (2021). In summary, each node in the SLAM pose graph has three camera images and an undistorted LiDAR point cloud with identical timestamps, which can be used for 3D reconstruction and novel-view synthesis.
Millimetre-accurate TLS
To obtain an accurate 3D reference model to benchmark localisation and reconstruction, we used a Leica RTC360 TLS (Figure 3, left). It has a maximum range of 130 m and a Field-of-View of 360° × 300°. The final 3D point accuracy is 1.9 mm at 10 m and 5.3 mm at 40 m. The point clouds are coloured using 432 mega-pixel images captured by three cameras. Scans are registered in the field and re-optimised later using Leica’s Cyclone REGISTER 360 Plus software. The average cloud-to-cloud error in our sites ranges from 3 to 7 mm. After merging all the scans, the resultant point cloud is very large (10 GB), so for ease of use we downsampled the TLS scan to 1 cm resolution. Nonetheless, we provide the original raw TLS scans in our dataset.
Calibration
Multi-camera intrinsic and extrinsic parameters calibration
Mean, median, and standard deviation of reprojection residuals for different calibration types.
IMU calibration
To facilitate visual and lidar odometry using the IMU, it is crucial to appropriately model the noise parameters of the IMU. For this, we measured the Allan variance parameters 2 of the IMU accelerometer and gyroscope using an 8-hour data sequence.
Camera-IMU extrinsic calibration
With these accurate camera intrinsic parameters and IMU noise process parameters, we then perform camera-to-IMU extrinsic calibration individually for each of the cameras using Kalibr (Rehder et al., 2016). We cross-validated the consistency of the camera-IMU calibration by measuring the variation in the estimated coordinates of the IMU, using the individual camera-IMU extrinsic parameters and the camera-camera extrinsic parameters.
Camera-LiDAR extrinsic calibration
Camera-LiDAR extrinsic parameters are calibrated in a bundled fashion with the inter-camera extrinsic parameters from 4.1 held constant; a single LiDAR point clouds overlaid on the camera images. This demonstrates the quality of camera intrinsics calibration and camera-LiDAR extrinsics calibration. In the left camera, the regions of the building without LiDAR points are due to the LiDAR’s limited sensing range. Note that the motion undistortion process produces the jagged discontinuity in the LiDAR beam pattern in the right camera. This image shows LiDAR overlay on top of the right-facing camera (cam2) demonstrating the effects of accurate LiDAR ego-motion undistortion. The motion here is approximately 0.9 m/s and 30°/s, and the scene depth ranges between 1 m and 10 m. The right camera overlaps the start and end of the LiDAR sweep (the seam shown in green). The top figure shows the unprocessed raw point cloud overlay. The LiDAR points are slightly misaligned closer to the start of the sweep (b), and significantly misaligned at the end of the LiDAR sweep (a). With IMU-only point cloud undistortion, we achieve consistent overlays (bottom figure) (c) and (d), demonstrating the accuracy of the multi-modal spatiotemporal calibration.

Dataset
Data format
The Oxford Spires Dataset consists of data collected in six sites in Oxford, UK, with multiple sequences taken at each site (Section “Sequence Description”). The data is originally collected as rosbags. 3 We also provide raw sensor data (as individual files) as well as processed data. The processed data includes outputs from an example LiDAR SLAM system and a SfM system. Finally, we also provide the ground truth trajectories and reconstruction.
The following sections describe the raw data formats and the folder system which we provide for easy use of the data outside of ROS (Figure 6). File structure of the Oxford Spires dataset: For 
Raw – camera images
The 20 Hz raw colour fisheye image streams from the three cameras of the Frontier are debayered and stored as 8-bit JPEG images. The three cameras are hardware-synchronised with each other, and hence the image triplets have the same timestamps. The images are stored as
Raw – 3D LiDAR point clouds
3D point clouds were collected using a Hesai QT64 LiDAR at 10 Hz, and stored as
Raw – IMU measurements
The linear acceleration and angular velocity measurements from the IMU are stored in
Processed – VILENS-SLAM outputs
We provide the estimated trajectory and the motion undistorted point clouds output by LiDAR-inertial SLAM (VILENS-SLAM (Ramezani et al., 2020a; Wisth et al., 2023)). The trajectory is saved as
Processed – COLMAP outputs
A solution to SfM is required as input to both MVS methods and radiance field methods (NeRF and 3D Gaussian Splatting). To facilitate researchers, we ran the state-of-the-art SfM method COLMAP (Schönberger and Frahm, 2016) for each sequence and provide its outputs. Specifically, COLMAP provides camera information in
Running SfM for all images captured at 20 Hz results in a large amount of output data and computation time. This is unnecessary because consecutive images are very close to each other and thus redundant. To keep the number of images manageable yet providing enough viewpoints for visual reconstruction, we selected images that are synchronised to the SLAM pose graph point clouds and spaced 1 m apart. We then ran COLMAP on this set of images. At walking speed, this results in a frequency of about 1 Hz. For each sequence, the total number of images was less than 2000 (for each of the three cameras). Using images aligned with corresponding LiDAR depth is also useful for methods that fuse LiDAR and vision, for example, colourising point clouds and depth-aided radiance fields such as Urban Radiance Field (Rematas et al., 2022). These aligned depth images are also provided in addition to the COLMAP outputs.
For compatibility with Nerfstudio (Tancik et al., 2023) (a popular open-sourced code base for state-of-the-art radiance field methods), we also convert the outputs from COLMAP into a
Correcting the metric scale of vision-based 3D reconstructions produced by MVS and radiance field methods is necessary to enable comparison to the metric ground truth. To estimate the scale, we used Umeyama’s method
5
to estimate a
Ground truth – reconstruction
We provide the registered individual TLS scans from Leica RTC360 for each site as the ground truth reconstruction. Each scan is saved as
Ground truth – localisation
The ground truth trajectory is computed by ICP registering each undistorted LiDAR point cloud (as described in Section “Processed - VILENS-SLAM Outputs”) to the merged TLS map described in Section “Ground Truth - Reconstruction”. To obtain the ground truth pose for each LiDAR scan, we use an offline version of VILENS (Wisth et al., 2023) with Iterative Closest Point (ICP) (Besl and McKay, 1992) at its core. Running offline, we can allocate enough time to register the LiDAR to the colourised TLS map. The trajectory estimated through the described process is synchronised with cameras by the procedure described in Section “Handheld Perception Unit”. This is in the same manner as for Newer College (Ramezani et al., 2020b) and Hilti-2022 (Zhang et al., 2022). The accuracy of the ground truth trajectory is approximately 1–2 cm. We validated the ground truth trajectory by projecting the individual LiDAR scans into a map and comparing them to the TLS map. The trajectory is provided as
Depth images from the TLS map
Additionally, we include ground truth depth images rendered from the TLS map using the ground truth sensor trajectories. In Figure 7, we show an example of an image rendered from the Keble College site with its corresponding depth image. These images could be used for evaluating monocular depth estimation or novel view rendering.
Sequence description
The dataset was recorded in six historic sites in Oxford, UK: • Bodleian Library (∼37 000 m2) • Blenheim Palace (∼14 000 m2) • Christ Church College (∼26 000 m2) • Keble College (∼18 000 m2) • Radcliffe Observatory Quarter (ROQ) (∼12 000 m2) • New College (∼18 000 m2)
Each sequence was collected by walking with the Frontier payload device mounted in a backpack as shown in Figure 3.
Bodleian Library
This site consists of the area around the Bodleian Library, which includes Radcliffe Square, where the Radcliffe Camera and Oxford’s University Church are located. It also reconstructs the outside of the Sheldonian Theatre and some of Broad Street. This part of the dataset contains the most iconic landmarks of the historic centre of Oxford (Figure 1).
For this site, we provide two outdoor trajectories of walking through streets and squares around the described area. The recordings contain many details of the predominant medieval and Gothic buildings.
Blenheim Palace
This is one of England’s largest houses and is notable as Sir Winston Churchill’s ancestral home. Five trajectories were captured in the palace’s main square, the principal hall, and rooms in the west wing, including the library. The trajectories have outdoor and indoor parts, including different-sized rooms and corridors. In Figure 8(b), we show an example of a sequence in the palace’s main square. Examples of SLAM trajectories (in red) and LiDAR point cloud maps (in blue) for four sequences from the dataset.
Christ Church College
Founded in 1546, Christ Church is a constituent college of Oxford and one of the city’s best-recognised locations. It contains Tom Quad, the largest square in Oxford, the college dining hall as well as Christ Church Cathedral and its cloister.
The sequences recorded in this site include outdoor areas with pavements and lawns as well as indoor parts with different lighting conditions and stairs accessing different levels, including the dining hall. One sequence is shown in Figure 8(a), which includes a complete loop of the perimeter of Tom Quad, which is challenging due to the limited range of the LiDAR sensor and repeating architecture.
Keble College
Keble is another constituent college of the University of Oxford. It comprises neo-Gothic-style buildings, including a hall and a church. Keble’s buildings are distinctive because they are constructed of alternating red and white coloured bricks – which provides an interesting challenge to visual reconstruction. This differentiates it from the other sites which are mostly built from limestone.
The Keble sequences were recorded outdoors in the college’s squares (Figure 8(c)), which include lawns and trees, as well as some interior and exterior parts.
Radcliffe Observatory Quarter
The ROQ site consists of the Faculty of Philosophy, the Mathematical Institute, and St Luke’s Chapel. This area is near the Oxford Robotics Institute, where the authors are affiliated. The two sequences recorded here contain squares with pavement, lawns, trees, narrow spaces between some buildings, and a fountain containing fine 3D details. In Figure 8(d), we show an example of a sequence through this site.
New College
New College is another constituent college of the University of Oxford and is located in the city’s historic centre. It contains squares, a hall, a church and a cloister. We recorded four sequences in New College, including an oval lawn area at the centre of the main quad surrounded by medieval buildings. The sequences combine outdoor and indoor parts with abrupt changes in light conditions. Most of the 03 sequence was walking through the park, containing a lawn and many trees. Some parts were fully covered by tree canopies. This site corresponds to the earlier New College Dataset (Ramezani et al., 2020b; Smith et al., 2009).
Benchmarks and results
In this section, we describe three benchmarks we have created to demonstrate our dataset. The benchmarks compare state-of-the-art methods for localisation (Section “Localisation Benchmark”), 3D reconstruction (Section “3D Reconstruction Benchmark”), and novel view synthesis (Section “Novel View Synthesis”).
Localisation benchmark
In the localisation benchmark, we evaluate the trajectories estimated for each sequence using state-of-the-art LiDAR Inertial SLAM and SfM approaches: • VILENS-SLAM: VILENS (Wisth et al., 2023) with pose graph optimisation (Ramezani et al., 2020a) (online). • Fast-LIO-SLAM (Kim et al., 2022): Fast-LIO2 (Xu et al., 2022) with pose graph optimisation and Scan Context loop closures (Kim and Kim, 2018) (online). • SC-LIO-SAM (Kim et al., 2022): LIO-SAM (Shan et al., 2020) with pose graph optimisation and Scan Context loop closures (Kim and Kim, 2018) (online). • ImMesh (Lin et al., 2023): LiDAR meshing with Fast-LIO2 odometry (online). • Fast-LIVO2 (Zheng et al., 2024): LiDAR visual inertial odometry (online). • HBA (Liu et al., 2023): LiDAR bundle adjustment using as input the VILENS-SLAM result (offline). • COLMAP (Schönberger and Frahm, 2016): Structure-from-Motion using only images (offline). For image matching, we used a sequential matcher with loop closure detection.
We evaluate sequences on the trajectories which were completely within the ground truth TLS map. We exclude Keble College 01, Blenheim Palace 03 and 04, Christ Church College 04 and 06, and Bodleian Library 01 from this analysis.
Evaluation metrics
Results in RMS (m) of the ATE using the provided ground truth. SC-LIO-SAM fails on some sequences. COLMAP gives incomplete results on some sequences.
To transform the trajectories estimated by the methods to the ground truth coordinate frame (Section “Ground Truth - Localisation”), we use the
Experimental results
A comparison between the listed methods using the provided ground truth (Section “Ground Truth - Localisation”) is presented in Table 3 for each sequence. The offline methods on the right side (HBA and COLMAP) can take advantage of all available data and provide the most accurate results. While one might expect that HBA should provide better accuracy (as a LiDAR bundle adjustment approach), we achieved the best performance with COLMAP. We believe this is due to the overlapping multicamera configuration, which provides abundant view constraints and allows features to be seen from different perspectives with proper light conditions in general. Furthermore, the prior map we provide to HBA is imperfect which means that the undistorted point clouds can be imperfect, which would then lead in turn to inaccuracies in HBA. However, note also that COLMAP was unable to solve some sequences (with poor lighting) at all and instead produced multiple disconnected sub-models. This is usually due to insufficient visual features being matched in the area where two sub-models ought to connect, which can be due to there being insufficient visual overlap or there being challenging lighting conditions (e.g. The dining hall in Christ Church College is relatively dark). In comparison, the LiDAR SLAM systems are invariant to lighting conditions and the present visual features.
The second most accurate method is HBA, which further refines VILENS-SLAM’s trajectory estimation with LiDAR bundle adjustment in post-processing. Of the online methods, VILENS-SLAM and Fast-LIVO2 performed best. Fast-LIVO2 performs better in short-loop sequences (Keble College and Radcliff Observatory Quarter), as the map created is also seen, and its drift is low. In contrast, VILENS-SLAM performs better in large loops as it performs loop closures. ImMesh and Fast-LIO-SLAM use Fast-LIO2 (Xu et al., 2022) as their core odometry module, which achieved accurate trajectory estimation. SC-Fast-LIO performs loop closure correction using Scan Context (Kim and Kim, 2018), while ImMesh can recover from drifts while generating the mesh map. SC-LIO-SAM produces satisfactory results on some sequences using Scan Context (Kim and Kim, 2018) as an appearance-based place recognition module. However, it adds incorrect loop closures in sequences with large loops and repeated building patterns, such as Christ Church College and Blenheim Palace. We note that these methods could potentially perform better with further parameter tuning.
In Figure 9, we show a representative example of the performance of the evaluated methods using Sequence 01 of Blenheim Palace. All of the methods produce an accurate trajectory except for SC-LIO-SAM, which incorporates an incorrect loop closure when closing the large loop. A top-down view showing a representative performance of the different systems for Sequence 01 at Blenheim Palace. The sequence starts and ends in the lower left. The environment where this sequence was collected can be seen in Figure 8 (b).
3D reconstruction benchmark
The reconstruction benchmark evaluates outputs from the systems that use vision or LiDAR. Specifically, we evaluate the following systems: • VILENS-SLAM: Merged LiDAR point clouds using poses obtained using the LiDAR SLAM system described in Section “Localisation Benchmark” (online). • OpenMVS
6
: an MVS system which uses input from COLMAP (Schönberger and Frahm, 2016) (offline). • Nerfacto: The default and recommended method from Nerfstudio
7
(Tancik et al., 2023) that combines features from MipNeRF-360 (Barron et al., 2022), Instant-NGP (Müller et al., 2022) and others. It uses input from COLMAP, as with OpenMVS (offline).
The outputs from each system are all in the form of 3D point clouds. For Nerfacto, the point cloud is generated from the trained model by calculating the expected depth and colour for the training rays, and projecting the depth points into 3D.
We selected example trajectories that are completely within the ground truth reconstruction from Blenheim Palace, Christ Church College, Keble College and Radcliffe Observatory Quarter.
Evaluation metrics
We use the F-score as the primary metric for reconstruction. The F-score is calculated as the harmonic mean of precision and recall, thus it considers both aspects of the reconstruction: accuracy and completeness. To calculate precision and recall, we consider a point to be a true positive (TP) if the distance from it to the closest ground truth point is within a certain threshold. We report results using 5 cm and 10 cm thresholds. False positives (FP) are reconstructions that are further from the ground truth and thus inaccurate. False negatives (FN) are regions in the ground truth that have no neighbouring points in the reconstruction, and are thus incomplete. Specifically, precision and recall are defined by
The F-score is then calculated as
Reconstruction filtering
In practice, the reconstruction and ground truth reference models will contain regions which were not mutually scanned. If unaccounted for, this would lead to erroneous false positives and false negatives. In turn, this would result in precision and recall measures which do not reflect the true quality of the reconstruction. For fairer comparison, we filter out points in the reconstruction that fall outside the reconstructed ground truth region, that is, the regions not reconstructed in the ground truth model. In particular, for Nerfacto the sky must be specifically removed because, as a dense representation, it attempts to reconstruct it using available depth cues. We filter these sky point clouds to ensure the evaluation focuses on the reconstruction of the physical environment itself.
Experimental results
Quantitative evaluation of the 3D reconstructions from VILENS-SLAM, OpenMVS and Nerfacto.

Comparison between the reconstructions achieved by the different methods. The reconstructions in the first three columns are coloured by point-to-point distance to the ground truth model.
Reconstructions from OpenMVS are accurate in regions with abundant view constraints and distinct texture, but it is not able to reconstruct surfaces with uniform texture such as the ground in Blenheim Palace and the lawn in Christ Church College. The error distribution in the MVS cloud is not uniform and tends to appear at surface boundaries where occlusion is an issue.
Although both OpenMVS and Nerfacto are purely vision-based reconstruction methods, Nerfacto point clouds are generally less precise. This is because MVS filters uncertain points (by checking photo-consistency), but the NeRF approach instead optimises a continuous radiance field without an explicit notion of uncertainty. For regions with insufficient view constraints and uniform texture, Nerfacto estimates incorrect depth values which leads to uneven ground reconstructions. In comparison, OpenMVS filters some of the reconstruction there, which leads to better precision and accuracy.
The reconstruction quality is determined not only by the reconstruction method but also by the accuracy of the input trajectory. Both precision and recall can be affected by an imperfect trajectory estimation. Clouds produced by VILENS-SLAM contain surfaces with high error that are the result of incorrectly registered LiDAR scans. Meanwhile, for Christ Church College, both the OpenMVS and Nerfacto reconstructions do not contain the dining hall (bottom left in the corresponding reconstructions from Figure 10). This is because the dining hall could not be registered with the outdoor square by COLMAP (partly due to the poor lighting conditions as explained in 6.1.2).
Novel view synthesis
We evaluate the quality of novel-view synthesis using the radiance field methods. Specifically, we evaluate: • Nerfacto (Tancik et al., 2023) which is described in Section “3D Reconstruction Benchmark”. • Splatfacto (Ye et al., 2024), an implementation of 3D Gaussian Splatting (Kerbl et al., 2023) with quality comparable to the original implementation.
We also include results using the above methods with increased representation capability, namely Nerfacto-big (Nerfacto with larger hash grid size and proposal network size, and more ray samples) and Splatfacto-big (Splatfacto with lower thresholds for densifying and culling 3D Gaussians, which results in more Gaussians being used). All methods are trained for 5000 iterations. We select one in every 10 images as the in-sequence evaluation images.
Evaluation metrics
We measure the quality of the rendered images using the Peak Signal-to-noise Ratio (PSNR), Structural Similarity (SSIM) (Wang et al., 2004) and Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018) metrics, as commonly used in the literature (Barron et al., 2022; Mildenhall et al., 2021; Tancik et al., 2023).
Out-of-sequence novel view synthesis
When evaluating radiance field methods, methods often use test poses that are close to the training poses. This is typically because the test poses and training poses are sampled from a common input trajectory. In downstream applications, the ability to render photorealistic images from viewpoints that are quite different from the training poses is crucial. To facilitate research in this direction, we generate challenging test sets whose viewpoints are very different from the training sets. Specifically, we merged images from different sequences taken in the same site using COLMAP. Then, we manually selected training and test set images that are far apart or have very different view directions. We describe the images that are selected from the input trajectory as ‘in-sequence’ and images from a separate trajectory with different viewpoints as ‘out-of-sequence’.
Experimental results
Quantitative evaluation of Novel View Synthesis. The test images are selected from the input trajectory (In-Sequence) as well as a separate trajectory with viewpoints far from the input trajectory (Out-of-Sequence).

Illustrative results of Splatfacto-big when evaluated using in-sequence (green) and out-of-sequence (red) trajectories. When the rendering viewpoint is quite different from the training trajectory, the rendered images exhibit many more artefacts. The in-sequence and out-of-sequence trajectories in Radcliffe Observatory Quarter and Blenheim Palace are in different directions, while the trajectories in Keble College have similar viewing directions but are from distant positions. From our test, we found that Splatfacto-big generates more visual artefacts than Nerfacto-big.
For the methods we tested, we found them all to be capable of generating reasonably photo-realistic images when rendering from in-sequence poses. A key difference between Nerfacto and Splatfacto is the rendering speed at test time: Both Splatfacto and Splatfacto-big render at 3.5 Hz on average, while Nerfacto renders at 1.25 Hz and Nerfacto-big at 0.57 Hz. When using the ‘big’ version for Nerfacto and Splatfacto, the rendering quality is generally better with LPIPS increased by 9.6% and SSIM by 2% on average. This improvement is not always reflected in the PSNR measure, because it is also affected by the per-frame appearance difference (e.g. lighting) (Martin-Brualla et al., 2021) as illustrated in Figure 12. For this reason, we give more consideration to changes in LPIPS and SSIM. The appearance difference issue can be potentially addressed by techniques such as test-time appearance encoding optimisation (Martin-Brualla et al., 2021). Comparison between an evaluation image and a rendered image from Nerfacto (Tancik et al., 2023). The PSNR metric is affected not only by the visual scene, but also by the lighting difference. Nerfacto uses per-frame appearance encodings (Martin-Brualla et al., 2021) which are optimised during training. When rendering at test time, Nerfacto uses the averaged appearance encoding of the training images. In the novel-view synthesis evaluation, we give more consideration to LPIPS and SSIM since they are more invariant to lighting differences.
Limitations and future work
The AlphaSense cameras have limited dynamic range. To be able to capture both dark and bright environments, we used the auto-exposure function of the cameras during the data collection. However, images captured using auto-exposure will have inconsistent pixel intensity when observing the same 3D structure from different viewpoints. Because of this, merging colourised LiDAR point clouds would lead to a mixture of different colours in the reconstruction. We believe this is an important research question, and there are several promising directions to address this issue, including estimating image exposure time (e.g. R3LIVE++ (Lin and Zhang, 2024)) and modelling image appearance as latent features (e.g. NeRF-W (Martin-Brualla et al., 2021), GS-W (Zhang et al., 2024)).
Our localisation and reconstruction benchmarks primarily evaluate classical SLAM systems. The benchmarks can be extended to include more recent learning-based SLAM systems such as DROID-SLAM (Teed and Deng, 2021), Gaussian Splatting SLAM (Matsuki et al., 2024b), MASt3R-SLAM (Murai et al., 2025) and PIN-SLAM (Pan et al., 2024). Evaluating these methods could provide more insights into their performance in large-scale outdoor data and facilitate the development of new approaches.
Conclusions
We present a large-scale dataset with colour images and LiDAR scans paired with high-quality ground truth 3D models and sensor trajectories. We demonstrate that the dataset is suitable for evaluating a variety of tasks in robotics and computer vision including LiDAR SLAM, Structure-from-Motion, Multi-View Stereo, Neural Radiance Field and 3D Gaussian Splatting. The scale of the provided data sequences and the quality of the ground truth trajectory and reconstruction make it suitable for evaluating large-scale localisation and 3D reconstruction methods in an outdoor environment. In addition, the colour cameras used in our dataset make it suitable for evaluating radiance field approaches, and encourage the development of SLAM systems integrated with radiance field representations. In particular, we demonstrate that state-of-the-art radiance field methods require further development to be applicable in the robotics context, namely inaccurate 3D geometry and limited generalisation capability when tested with poses distant from the training sequence.
