Abstract
Keywords
Introduction
Highly turbulent, internal flows underpin crucial technology in a range of sectors from transport and power generation to chemical processing and biofluids. Defined as flows that are bounded by ducts or channels, internal flows are needed in situations where the direction and supply of a fluid needs to be controlled, 2 with applications as diverse as the flows through the heart and lungs,3,4 home appliances 5 and industrial furnaces. 6 In particular, in the automotive, marine and aerospace industries, these flows are used to power propulsion systems that are essential to the development of highly efficient and low-carbon transport solutions.7,8 Experimentally, these flows are often characterized using velocity measurements from particle image velocimetry (PIV). 9 The PIV method generates velocity vectors at discrete points in the flow, such that spatially-dependent flow behaviour can be easily observed and quantified. An example of a post-processed image from PIV can be seen in Figure 1. However, conducting PIV for internal flows presents a unique set of challenges, as it is often difficult to achieve a full field of view. Gaps in the data arise due to shadowing (occlusions due to walls or other components), laser alignment issues, irregular seeding density of the tracer particles, background reflections and light scatter and strong out-of-plane motion for 2D measurements.10,11

Example PIV image, showing a circular field of view. At each pixel the arrows show the direction of the turbulent flow and the colourmap shows the velocity magnitude.
Attempting to rectify gappy PIV data through experimental reruns may be costly or in some cases impossible, and as design work becomes increasingly digitalised, experimental data will need to be compared to or assimilated with typically clean simulation data for validation purposes12,13 or in the construction of digital twins. 14 The pursuit of accurate digital twins and simulations of complex geometries has the potential for monumental savings in the time and costs associated with mechanical engineering design, by lessening the need to build and test prototypes.15,16 Developing accurate methods for reconstructing full flow fields from gappy PIV data is therefore a topic of significant interest in the study of industrially-relevant flows and the development of related technologies. Furthermore, flow reconstruction models can provide valuable insight into turbulent flow behaviour more generally, by learning mappings from observable parts of the flow to obscured regions which could be leveraged towards a range of other flow reconstruction problems such as 2D–3D prediction 17 or full state reconstruction.18,19
The development of numerical methods that can fill gaps in spatio-temporal turbulent flow data has a history spanning several decades. Of particular note is the family of methods stemming from what came to be known as the gappy proper orthogonal decomposition (GPOD), introduced by Everson and Sirovich. 20 These methods employ the POD (akin to the principal component analysis) to identify dominant flow structures in a dataset,which are used to inform the velocity predictions inside the gaps. The predictions are updated iteratively as the number of POD modes (principal components) considered for the reconstruction is incremented. GPOD became widely-used in the field of turbulent flow diagnostics and received several updates and improvements.21–24 Part of the reason for GPOD’s popularity in the fluid mechanics community is due to the emphasis that POD methods place on dominant low-rank features, which may be analogous to the concept of coherent structures in turbulent flows.25–27 Therefore, results from GPOD retain a degree of physical explainability. In addition, as fluid flow data are often negatively affected by noise, outliers and potentially less relevant small-scale turbulent structures,28,29 high levels of reconstruction accuracy can be achieved by mainly focusing on these low-rank structures.10,23
Beyond traditional fluid mechanics research, the restoration of missing or damaged regions of an image is a well-known task in the image processing and computer vision fields, referred to therein as inpainting.30,31 As such, data-driven methods developed in these domains can also be adapted to the problem of gappy PIV image reconstruction presented here. For example, autoencoder neural networks have been widely used in turbulent flow applications, partly because the dimensionality reduction through the latent space maintains the focus on dominant flow structures.18,32 Convolutional neural networks (CNNs) are also widely used due to their ability to utilise the local spatial relationships inherent in turbulent flow data on grids.33–35 In particular, the UNet architecture 36 has demonstrated success in a variety of tasks including flow field prediction and super resolution.37–41 However, UNets are thought to exhibit some difficulty in capturing the dependency relationships of global features, 42 due to the relatively slow expansion of the receptive field through the convolutional layers.
Higher levels of performance have been reported by combining UNets with transformer modules for turbulent flow field prediction and reconstruction from limited measurements,43–45 due to an enhanced ability to more accurately capturing the multi-scale relationships present in turbulent flows. Regarding alternative architectures, generative adversarial networks (GANs) have been used successfully in preserving multi-scale statistics of turbulent flows for super-resolution46,47 and inpainting. 48 Physics-informed neural networks (PINNs) 49 also show promise for creating more generalizable ML models, especially for laminar and fully 3D flows.50–52 However, the suitability of PINNs for the present 2D PIV setup is currently unclear, as the only data available are two velocity components along a 2D slice of a 3D system, which stretches the validity of the conservation equations. In addition, the spatially correlated noise introduced by the cross-correlation algorithm in the PIV process has been shown to significantly degrade the results from a PINN. 53 Other developments such as graph neural networks54–56 and neural operators57–59 have demonstrated success in scientific ML applications, but have not seen significant use in inpainting to date.30,31
In the literature on inpainting for turbulent flows, there is a noticeable lack of discussion on how artificial gaps should be created for training and testing ML models. Common methods of adding gaps to clean data include random noise,60,61 clustered dropouts22,23 and block gaps.24,39,62 Rectifying random noise and clustered dropouts is often an easier task for ML models due to the large amount of spatially local information that can be interpolated from Buzzicotti et al. 63 Conversely, due to the larger gap sizes, block gaps can be more challenging to handle. However, studies to date have only considered blocks of standard shapes and fixed orientations, which can be unrealistic and of limited use in practical scenarios where complex geometries can obscure the field of view in any number of ways. Furthermore, there are large number of inpainting models that are available to choose from, and it is not straightforwards to determine which scenarios will cause one model to perform better than another, and why. In addition, there is a lack of clarity in how different gap-handling approaches affect the results of an inpainting model. These points motivate the need for an objective benchmark to be established.
Benchmarks typically consist of an open-source dataset, and a well-defined task that can be used to objectively compare model performances, and they can be essential for developing numerical methods. For example, the ImageNet Large Scale Visual Recognition Challenge benchmark is often credited with catalysing the deep learning explosion, having facilitated the development of the famous AlexNet model.64,65 Within the realm of turbulent flow research, several large flow physics datasets exist, including the Johns Hopkins Turbulence Database, 66 BLASTNet database 67 and the turbulence data from McConkey et al. 68 However, each of these datasets represent idealised flows over small domain sizes, which do not reflect the complex geometries and operating environments associated with physical machinery. Other databases address more practical geometries and domain sizes, such as the AirfRANS dataset for airfoil shape optimisation 69 and the Cambridge-Sandia burner for a variety of swirling stratified flows. 70 Regarding engine-specific flows, PIV datasets have been published by the Engine Combustion Network (ECN) 71 and the General Motors University of Michigan Automotive Cooperative Research Laboratory. 1 Data collected by the latter for the TCC-III combustion chamber are currently used in the benchmark established here, as the TCC-III setup was specifically designed to challenge the predictive capabilities of computational fluid mechanics (CFD) simulations, with the geometry promoting extremely complex fluidic motion via strong turbulence and high cycle-to-cycle variations. 1 In a development philosophy similar to that of CFD models, it is expected that ML models can be made more generalisable by being trained on highly challenging datasets. 72 These characteristics of the TCC data also proved valuable in the development of non-parametric dimensionality reduction approaches in the early 2010s.73,74 A schematic of the TCC–III setup is provided in Figure 2.

Schematic showing the TCC–III and associated PIV measurement planes.
This work aims to lower the barrier to entry into ML for engine researchers and makes the following contributions. A flow reconstruction target is proposed in the form of the edge gaps inpainting task, which challenges the ML models while emphasising practical relevance to engine research. A novel data augmentation method designed to add random edge gaps into the training data is introduced, which outperforms the other standard methods tested here. The performances of five neural network-based models are benchmarked against the GPOD method, providing engine researchers with an objective basis for comparison that can be used to inform future model selection. The overall process followed by this work is illustrated in Figure 3. All relevant code is published along with quick-start tutorials at the web link provided in the abstract in order to promote transparency and lower the barrier to entry into ML for engine researchers. Overall, the intention of this work is to accelerate the development and adoption of inpainting models and related ML techniques within the engine research community.

Flow chart depicting the benchmark creation process followed here. Descriptions of the task, data and model implementations are provided in the Benchmark setup section.
Benchmark setup
Engine system
The TCC–III is a port-injected, spark-ignition, single-cylinder optical research engine with a single intake and exhaust valve each and a pancake-shaped combustion chamber consisting of a flat head and piston. 1 It has a bore × stroke of 92 × 86 mm and a geometric compression ratio of 10:1. Details of the operating conditions that produced the data used in this work are provided in Table 1. Optical access is provided via a full quartz cylinder and a 70 mm diameter flat quartz piston window. A dual-cavity Darwin Duo, Quantronix laser was used to illuminate silicone-oil seeder droplets 1 mm in diameter and images were taken with a high-speed monochrome Phantom v1610 camera (Vision Research). A multi-pass algorithm was used to process the vectors, with a decreasing interrogation window size from 128 ×128 to 32 × 32 pixels with 50% overlap. The final window size produced vectors with a spatial resolution of 1.25–1.4 mm.
Key EngineBench dataset information.
The PIV data are released publicly and were created with funding by General Motors through the General Motors University of Michigan Automotive Cooperative Research Laboratory, Engine Systems Division. The TCC–III was intended to be used for providing engine-relevant data to assess and validate large-eddy simulation (LES) models, with a history dating back to the TCC-0 in the 1990s. 1 In particular, the pancake chamber design simplified meshing for CFD models and produced extremely large cyclic variations in order to challenge CFD predictive capabilities. More details on the experimental set-up are provided in Schiffman et al. 1
Dataset
There are two databases proposed in this work, EngineBench and EngineBench LSP small. The EngineBench database consists of PIV data from motoring (i.e., unfuelled) the TCC-III engine. 1 The full database contains over 400,000 PIV images, coming to a size of 31 GB, as listed in Table 1. The dataset is stored on Kaggle as a series of h5 files, as the natively hierarchical format simplifies the chunking of data so that train/validation/test splits can be separated by specific phase angles or test points. Also, h5 files have the capability for lazy loading and the binary file format allows for efficient data storage. A diagram illustrating the hierarchical structure of each h5 file is given in Figure 4.

Generalised h5 file structure in EngineBench.
In order to accelerate the training times, as numerous model configurations (44 in total) were tested for the benchmark, a subset of the EngineBench data, named EngineBench LSP small, was constructed and used to generate the results. The use of a subset also makes the benchmarking results more accessible to researchers with smaller memory computers and informs practitioners on how the ML models perform with smaller datasets. The subset was constructed solely using data from the lower swirl plane (LSP), as the field of view remains constant with the changing crank angle position, simplifying the analysis. Five crank angles are extracted at phases of interest throughout the engine cycle at one operating point, as presented in Table 1. EngineBench LSP small therefore contains 5205 PIV snapshots in total and is also hosted on Kaggle, accompanied by tutorial notebooks to demonstrate how the data can be interacted with. Finally, the original spatial dimensions for each image are 50 × 49 pixels. Zero padding is therefore added around the edges of the images to 128 × 128 for compatibility with standard ML models.
Target
The goal of the benchmark was chosen to be the inpainting of so-called ‘edge gaps.’ In this work, edge gaps are defined as large blocks of missing data at the edges of the field of view. This type of gap was selected for a number of reasons. Firstly, they are more realistic than other types of gaps such as randomly-located blocks; edge gaps commonly occur in PIV setups that have restricted optical access due to walls.75,76 In addition, it is especially challenging to predict the flow inside edge gaps, as there is a limited amount of local information that can inform the models. From the model’s perspective, predicting the flow inside edge gaps is therefore akin to extrapolation beyond the field of view. This difficult challenge is intended to push the boundaries of what is possible with flow field data reconstruction.
A consistent test case is therefore constructed using edge gaps to benchmark the performance of the inpainting models. Two masks of a fixed shape are constructed that each remove the data at a proportion of the pixels at the edges of the field of view. A vertical mask is applied to the first half of the test set and a horizontal mask to the latter half. In addtition, two gap sizes are tested, consisting of 10% and 25% of the data missing. An example test flow field with 10% of the data missing is shown in Figure 5. Within EngineBench LSP small, all the data at 180 CAD aTDCf are held out for the test case in order to assess the generalisability of the models. The flow fields at 180 CAD are notoriously challenging to predict, as the piston is on the point of switching its direction of travel, causing the flow patterns to be highly variable. 72

Example PIV image from EngineBench LSP small, with the horizontal edge gaps added at the top and bottom of the field of view.
Models and training
The performance of four different model architectures is benchmarked in this study. Firstly, adaptive median filter GPOD (GPOD-MF) is chosen as a best-in-class non-parametric approach, known to outperform interpolation and other GPOD methods. 23 Secondly, the UNet model 36 is chosen due to its wide usage in turbulent flow research. Three loss functions are tested with the UNet: a mean square error (MSE) loss, a huber loss function in order to test the effect of outliers in the PIV data and a physics-based gradient loss. Further details of the loss functions are given later. Thirdly, the UNet transformer (UNETR) model 77 with an MSE loss is chosen due to the performance enhancements that have been reported due to the transformer module, with the project MONAI implementation. 78 Finally, an adapted version of the context encoder generative adversarial neural network (CE-GAN) 79 with MSE and adversarial losses is implemented due to its high performance in standard inpainting tasks30,80 and recent usage in turbulent flows.48,63 As the original context encoder was designed for inpainting gaps of a fixed size and location, the network architecture is modified in a similar fashion to the changes made by Li et al. 48 For the generator, an additional de-convolutional layer is included at the output to return a prediction of the same spatial dimensions as the input, forming a symmetrical autoencoder architecture. To correspond with the generator modifications, an extra convolutional layer is added at the beginning of the discriminator to handle inputs of the same size as the original data. A dropout layer with a probability of 50% is also added at the output of the discriminator, following Li et al. 48 Model summaries implemented here are provided in Table 2 for reference.
ML sizes in millions of parameters and loss functions tested.
During the ML model training, the losses between the predictions and the labels are calculated across the entire image, not just inside the gap. This approach provides a number of benefits: to simplify the random gaps training process, retain the context of the broader turbulent flow and field of view and to provide practitioners with a visual representation of how the network relates the prediction inside the gap to the rest of the field, avoiding edge effects in the output. Performance metrics on the test set predictions are then reported for the central regions as well as the edge gap regions. Training hyperparameters are chosen to reflect other turbulent flow ML studies.48,67 Training in all cases was run over 300 epochs. For the UNet and UNETR models, the learning rate was 1e–3 and multiplied by a factor of 0.5 every 50 epochs via a step scheduler. For the CE-GAN, the learning rate was 1e–4 and multiplied by 0.75 every 50 epochs. All architectural hyperparameters were retained from their original studies; however, the configurations are not guaranteed to be optimal for this specific case, as the focus of this work is on the development of an objective benchmark rather than the optimisation of the underlying ML models at this stage.
Finally, the training, validation and testing datasets were split by crank angle. As previously mentioned, 180 CAD was held out for the test set, while different permutations of the other four phase angles are then used to construct the training and validation sets, with three phases for training and one for validation. The training process for each model was run three times with different permutations of training and validation phase angles tests, which tests the sensitivity of the model performances to the specific phases chosen for the analysis and provides the error bars for the results. The crank angle permutations are defined in Table 3. Only three permutations are considered out of the possible four as the spread across permutations was found to be acceptably low on all metrics. The resultant number of images in the train, validation and test sets are 3123, 1041 and 1041.
Definitions of phase angle permutations that comprise the training, validation and hold-out test sets. The different permutations are denoted as A, B and C, and the corresponding phase angles are given in crank angle degrees (CAD).
Metrics
A variety of metrics are used to evaluate the model performances, in order to quantify pixelwise accuracy, vector similarity and multi-scale phenomena. The relative L2 error is used to quantify pixelwise accuracy and is calculated for true and predicted velocities
where
where
with the MI varying between 1 for vectors of identical magnitude and 0 for totally disparate vector magnitudes. Finally, in order to capture the multi-scale turbulent flow features, the energy spectrum
where
ranging from 0 for identical distributions to infinity for a complete divergence.
Loss functions
The widely-used mean-squared-error (MSE) loss
The Huber loss is a hybrid loss function that reduces sensitivity to outliers by applying an L1 loss to element-wise errors above a certain threshold (delta) and a quadratic loss otherwise to aid convergence. It is defined per pixel
This is then averaged over all pixels in the image pairing. A smaller value of the
Huber loss
The combined generator loss is given by:
The adversarial ratio
Adversarial loss lambda tuning results for the 180 CAD test case with 10% edge gaps. One result for each setup using permutation A is reported.
The gradient ratio
Here,
Data augmentation
One of the key considerations of this work is in how artificial gaps should be introduced into the data to train the models. This can be handled via data augmentation at training time. Three different techniques were investigated in this work: introducing fixed horizontal and vertical edge gaps like the test case (fixed edge), blocks of various sizes and locations (random blocks), and edge gaps of random size and orientation (random edge gaps). Some example random block gaps are shown in Figure 6, where the yellow regions indicate areas where data were removed from the snapshots.

Samples of the randomly-generated block and edge gaps used to train the models in this study, for gap sizes of 10% and 25% of the total area.
The random edge gaps are constructed by taking four random points along the input image borders, drawing a polygon between the points and masking out pixels that lie outside of the polygon. There can be a maximum of two points on any one edge. This approach ensures that edge gaps are created with random sizes and orientations, to prevent the models from overfitting to specific gap shapes and locations. A maximum percentage of the pixels are allowed to be removed by the mask; the mask is discarded if it removes more pixels than this, and a replacement mask is generated. This upper threshold for the gap sizes is needed to constrain the training process, prevent the inpainting task from becoming overly challenging and reflect more realistic physical scenarios. A histogram showing the proportion of pixels removed for each snapshot in one pass of the training set for 10% gaps is shown in Figure 7. A regular PIV snapshot is shown alongside two snapshots with random edge gaps added in Figure 8.

Histogram showing the proportion of pixels removed by the random edge masks in one pass through the training set. Seven percent of the total pixels in the field of view were removed on average.

Example random edge gap creation. From left to right: (a) original image, (b) image with a random edge gap polygon superimposed in red and (c) edge gaps added to regions outside of the random polygon.
Results
Training gaps
Firstly, the different artificial gap handling strategies previously described were tested with the UNet model, in order to establish the optimal training pipeline. The results for the four metrics tested are given in Table 6, with separate reports for the central image regions and the gap regions. Overall, the accuracy inside the central regions is very high in all cases, with RI = 0.999 and a pixelwise L2 error of ≈3%. This shows that the UNets are able to preserve the information provided to it in the input to a very high degree, despite compressing the data through the bottleneck. As expected, the accuracy inside the gap regions is worse, as the UNet is required to extrapolate beyond the field of view that was supplied at the input. However, the results still appear to be passable, with RI up to 0.88 at the 10% gap size and 0.82 at 25%.
Impact of training a UNet, MSE model on the fixed edges used as the testing gaps, random edge gaps, random block gaps and a combination of the latter two. One result for each setup using permutation A is reported. The final figure represents the average over the 1041 test images.
Training the UNet on fixed edge gaps, which have the same form as the test gaps, generally produced the highest accuracies in the image centres. This simpler training strategy allowed the model to focus more on the global flow patterns provided at the input, as the location of the gaps did not change from image to image. This emphasis on general flow patterns helped to also yield the best KL divergences within the edges at both 10% and 25% gap sizes. However, the weaker RI and L2 scores at the edges indicate that over-fitting the model to the fixed mask shape prevented it from generalising as well to the more specific localised flow behaviour in the unseen crank angle. Conversely, for the random edge gaps, the addition of significant variability to the process made it more challenging for the UNet to learn the global flow distributions as precisely. However, the random edge training did improve the predictions of local details and vector orientations, with the strongest RI and L2 scores at the edges.
The random blocks and combination of random blocks and edges were used to test whether a broader inpainting training process would help the model to generalise further. However, neither of these strategies produced higher accuracies than the fixed or random edge gaps in isolation. This shows that for this problem, the best performance can be achieved by providing the model with training and testing gaps that are of the same general shape and location; however, some randomisation within these general parameters via the random edge gaps did provide the strongest RI and L2 metrics inside the edge regions for both gap sizes. In addition, it is expected that models trained on random edge gaps will be able to handle test cases with edge gaps at any orientation, unlike models trained on fixed gap positions. Due to this improved flexibility, combined with strong scores across all four metrics, the random edge gaps method was deemed to have the most practical utility among the data augmentation methods tested here. Therefore, the random edge gaps technique was used in the training pipeline to benchmark the other model configurations investigated in this work.
Loss functions
Use of the Huber, adversarial and gradient loss functions require the tuning of parameters in order to determine suitable configurations. A grid search was performed in each case, following best practices laid out in previous studies.48,67 The Huber and gradient loss parameters (
For the adversarial ratio,
Gradient loss lambda tuning results for the 180 CAD test case with 25% edge gaps. One result for each setup using permutation A is reported.
GPOD convergence
As the final step before each of the model configurations can be benchmarked againsnt one another, the number of modes to be retained by the GPOD prediction is determined using the convergence criterion. Plots showing the GPOD convergence curves for 10% and 25% gaps in permutation A are shown in Figure 9. In both cases, the relative L2 error calculated in the convergence-checking (CC) gaps gives an optimal number of modes that is relatively close to the true optimum given by the L2 error in the true gaps. The number of modes retained in the final GPOD reconstructions benchmarked here considered the true lowest L2 errors, and were 26 for the 10% gap size and 9 for 25%. The lower number of modes in the latter case indicates that the GPOD algorithm relies on a reconstruction that contains more general flow patterns in order to optimise the error across the larger gaps in the different snapshots.

GPOD convergence curves for 10% and 25% gaps (left and right respectively) at permutation A. The minimum errors for both the true L2 error in the edge gaps and the L2 error in the convergence checking (CC) gaps are marked as filled circles.
Main benchmark results
The results for the benchmark performance metrics are given in Tables 8 and 9, with the best result for each metric presented in bold. Loss curves for each model configuration are provided in Figure A1 in Supplemental Material. The UNet and UNETR models exhibit similar performances across all metrics, with the UNet models slightly outperforming UNETR for predictions inside the edge gaps. As shown in Table 2, the number of parameters in the UNet architecture is eight times smaller than that of the UNETR model, so the UNet exhibits a better accuracy-complexity trade-off. This indicates that detailed local features and textures may be more predictive of the target outputs than global context in this situation, which runs counter to where UNETR models typically see performance gains.43–45
Results for the 180 CAD test case at 10% gaps. The mean and standard deviations are reported from the three permutations of training data defined in Table 3.
Results for the 180 CAD test case at 25% gaps. The mean and standard deviations are reported from the three permutations of training data defined in Table 3.
The UNet variants each exhibit similar predictive performances in the edge gaps at the 10% gap size, although the gradient loss function demonstrates the best RI, MI and L2 metrics at 25% gaps. This is in line with the results of Chung et al. 67 who showed that the gradient loss provided persisting benefits for a super-resolution task of increasing difficulty from 8× to 32× magnification. On the other hand, in the present work, the UNet, gradient model yields higher KL divergences in the edges, especially at 25% gaps, as shown in Table 9. This shows that the gradient loss function emphasises local regions with large velocity gradients at the expense of the overall energy distribution in the flow. As with the investigation on data augmentation strategies and loss function parameters, this is another example of how the models seem to face something of a trade-off between accurate KL divergences, and RI and L2 errors.
The accuracy of all UNet-based models is very high in the image centres, with KL divergences that round to zero at a three decimal place tolerance, showing that the original flow structures across all scales are being well-preserved. Ensemble averaged energy spectra for the UNet, MSE model predictions at 10% gaps are shown in Figure 10 and there is a near line-on-line match between the true and predicted spectra in the image centres. Note that the energy spectra are challenging to compute in the gappy regions in isolation, as sharp edges and discontinuities are prevalent, contributing to the Gibbs phenomena observed in the edge spectra in Figure 10. However, overall trends can still be seen and the UNet edge prediction follows a downward trend that is similar to the ground truth.

Energy spectra comparing the ground truth test set images to the UNet, MSE predictions at a 10% test gap size. Ensemble mean spectra are given by solid or dashed lines. Shaded areas represent one standard deviation from the mean.
Regarding the other metrics in the edge gaps, the L2 errors of both UNet and UNETR models are relatively high at between 45% and 47%. This is within the range of values reported by Li et al. 48 for large gap sizes, but about twice as high as other results reported by Morimoto et al. 35 for the reconstruction of a turbulent flow in a fixed gap shape. The reasoning behind this is explained in the Discussion section. For the RI and MI values of between 0.9 and −0.95 are commonly taken to represent self-similarity between vector fields. 83 The average RIs for the UNet and UNETR predictions at 10% gap sizes approach this criterion in the edge gaps and meet it in the central regions. The MI values are systematically lower, which is consistent with other reports that the MI is a stricter metric to satisfy, as it follows a linear relationship rather than the sinusoidal RI.83–85
Example flow field predictions from the UNet, MSE model at 10% gaps are shown in Figure 11. In the top row of the figure, the regions masked out by the horizontal mask are relatively uniform and easy to predict with no large variations in velocity magnitude. This allows the UNet to predict the flow inside the gaps to a fair degree of accuracy. On the other hand, for the bottom row, turbulent motion inside the edge gap regions is more complex, with the flow directions switching to point outwards just inside the edge gap regions. There are few obvious indicators for this motion in the centre of the image and the UNet struggles to fully predict this complexity. The scarcity of spatially local information due to the edge gaps highlights the challenge presented by this inpainting task; it is likely that more knowledge of the out-of-plane motion would be needed in order to predict such complex behaviour. To provide a clearer picture of the differences between these two flow fields, the point-wise L2 errors are shown in Figure 12. Plots showing example outputs from each of the models at 10% and 25% gap sizes are provided in Figures 13 and 14.

Sample flow fields from the 10% gaps test set. Top row: best UNet, MSE prediction (L2 = 0.225); bottom row: worst UNet, MSE prediction (L2 = 1.026). (a and d) Original snapshot with the test mask shown as horizontal red lines, (b and e) gappy input and (c and f) prediction.


Comparison of different model predictions for a single test snapshot at 10% gaps. Gappy images formed by removing data outside of the red lines in the original image are fed into the models.

Comparison of different model predictions for a single test snapshot at 25% gaps. Gappy images formed by removing data outside of the red lines in the original image are fed into the models.
The CE-GAN demonstrates relatively poorer performance across the board. Li et al. 48 also reported relatively low pixel-wise accuracies for the CE-GAN in an inpainting task on PIV data, but better performance than GPOD in terms of predicting multi-scale properties. These findings are supported in the present study; however, Tables 8 and 9 show that the CE-GAN results are worse than UNet-based models across all metrics, especially in the central image regions. Finally, the GPOD-MF method yields the lowest performance, as GPOD-MF is sensitive to the limited amount of spatial information available near the gap regions. Difficulties arise because the algorithm initialises the gaps with ensemble mean vectors calculated from the training set, then iterates on these guesses using the dominant flow features from POD-based reconstructions. However, for highly variational flows, the mean can be a poor approximation of the full dataset. 84 GPOD-MF can typically overcome this by utilising the local spatial information to update the guesses, but this is not so effective for large block gaps and errors can be compounded instead. Figure 9 shows that the algorithm converges at a relatively small number of modes, representing the dominant flow structures. While these dominant structures do not fare as badly on the global vector-based metrics, they are overly smoothed and differ vastly in terms smaller-scale flow structures, as shown by the large KL divergences between the GPOD-MF predictions and the true vectors in the edge gaps.
Discussion
These results have shown that UNet-based models are capable of extrapolating beyond the field of view by reconstructing the flow inside edge gaps to a reasonable degree of accuracy, significantly out-performing GPOD. At 10% gap size, all three UNets achieved an RI of at least 0.89 on average for the unseen crank angle, showing that vector alignments can be well-predicted in general. However, the 25% gap size presents a much harder challenge, with RIs falling to ∼0.82. The raw metrics at 10% and 25% gap sizes might not seem too disparate at first, but a visual inspection of Figures 13 and 14 reveals that the differences between these scores has significant consequences in the predicted flow fields. While the predicted flow fields at the 10% gap size appear to be reasonable in general, the predictions at the 25% gap size are unreliable, with large inaccuracies in the predicted flow motion.
This poses an interesting question as to what level of accuracy should be expected from an inpainting model and whether it is possible to accurately predict the flow inside edge gaps as large as 25% of the total pixels. The process of inpainting in this case relies on there being a strong correlation between the flow at the centre of image and the flow inside edge gaps. For the larger gap size, it is less likely for such a correlation to exist, as the distance between the outer gap regions and the nearest data-containing pixel is increased. If there is no strong correlation between these different regions of the flow, then this becomes an ill-posed problem, with many possible options for the flow behaviour inside the edge gaps. 79 Future work should investigate how the internal correlations within the flow relate to an inpainting model’s performance, to further inform what is and is not possible within this task.
Comparisons between the neural network models here show that UNet-based models exhibit similar performances, while the CE-GAN accuracies are markedly worse. In particular, the low accuracies inside the central regions indicate that the CE-GAN is not retaining as much of the information in the image centres as the UNet-based models are. Indeed, although both models incorporate autoencoder-like structures, while the CE-GAN generator has an AlexNet-like architecture, 64 UNet-based models utilise skip connections that are designed to preserve contextual information at each stage of the autoencoder. This allows the UNets to simultaneously preserve information in the image centre to a very high degree of accuracy and yield gap predictions that seamlessly integrate with the rest of the image. This has also correlated with better edge gap predictions for the UNet-based models in this case. It should be noted that this particular application pushes the CE-GAN beyond its initial design intention of solely predicting inside the gap region, rather than also reproducing the entire image. The other key difference between the UNets and the CE-GAN is the adversarial training dynamic employed by the latter, which if unstable, could also contribute to lower performances.
A note is needed regarding the relatively high L2 errors, of between 45% and 47% for the UNet-based models. It is hypothesised that the main reason for the higher L2 errors reported in this benchmark is the significant out-of-plane motion present in the TCC-III engine, which is not accounted for currently. As shown in the bottom row of Figure 11, the flow at the image centres reveal few indicators of the sudden change in vector directions inside the edge gaps. However, only two velocity components along a single PIV plane are observed here. It is possible that accounting for out-of-plane motion by gaining access to the third velocity component using techniques such as tomographic PIV 86 or assimilation with CFD data 87 will be required to significantly reduce the L2 errors in this situation and such investigations will constitute future work.
With the present results as they are, it is recommended that UNet-based models can be used to reconstruct the flow in large block gaps with as many as 10% of the total number of pixels missing to a reasonable degree of accuracy. Such reconstructed flow fields could be used to improve understanding of general flow patterns and replace ensemble mean-filled or interpolated flow fields as inputs into other data analysis methods like modal decomposition. However, despite the well-predicted vector alignments, care should be taken when using the vector magnitudes from the predicted flow patterns, as these were found to under-predict the ground-truth values. Note that the performance of the UNets is expected to improve for easier inpainting tasks such as smaller and more centrally-located blocks of missing data, in which case the prediction of these vector magnitudes would likely improve. Testing the sensitivity of inpainting models to a range of gap types is also recommended for future work. Finally, reliably inpainting edge gaps for the 25% gap size appears to be out of reach at present and as previously discussed, an investigation into how feasible it would be for any model to accurately reconstruct the flow in such scenarios is recommended future work.
Conclusion
This work has introduced the EngineBench database and used it to establish the first inpainting benchmark for an industrially-relevant turbulent flow, in order to address the limited availability of practical benchmarks on experimental data. The models were tasked with inpainting large edge gaps, a highly challenging problem that pushes the models to the limits of their capabilities. This benchmark was used to provide objective insight into how a range of widely-used models behave in these challenging conditions, identify success and failure modes and provide recommendations for future work.
Firstly, a number of data augmentation strategies that introduce artifical gaps of different forms into the data are tested to find the optimum strategy for inpainting edge gaps. A novel strategy, named random edge gaps, was created to introduce edge gaps of random sizes, locations and orientations into the training data. Although fixed gap training yielded the best results in terms of MI and KL divergence, random edge gap training was shown to result in the most accurate predictions of vector orientations and pixel-wise errors due to the improved generalisability. Random edge gap training is therefore recommended for the creation of flexible and generalisable inpainting models in this case.
Overall, UNet-based models demonstrated the best general performance across the four metrics and two gap sizes tested. Indeed, the UNet-based model predictions in the edge gaps approached self-similarity at the 10% gap size according to the vector-based metrics. This suggests that even gaps as challenging as large edge gaps can be reconstructed to a reasonable degree of accuracy with the use of a UNet, which exhibits significant performance improvements over the GPOD method for this type of gap. However, pixel-wise L2 errors remained relatively high for all model predictions. A visual inspection of the reconstructed flow fields revealed that sudden changes in the flow direction without any obvious indicators in the rest of the flow was a cause of lower reconstruction accuracies. It is therefore hypothesised that acquiring information on the out-of-plane motion, such as through stereo-PIV or data assimilation with CFD, would be needed in order to rectify this issue.
A comparison between the UNet-based models and the CE-GAN showed that the former were capable of preserving the central flow information provided at the input to a very high degree of accuracy and then leveraging this information to produce better predictions inside the edge gaps. This suggests that skip-connections, contained within UNet-based models but not the CE-GAN generator, are important architectural components that facilitate high accuracies for neural networks training to reconstruct turbulent flow data. This characteristic should be considered in future model development.
In summary, the main practical findings arising from this article are that deep learning methods can outperform GPOD for challenging flow reconstruction tasks; the performance of UNet-based models for up to 10% relative gap sizes is likely sufficient for informing general flow patterns and contributing to data analysis approaches such as modal decomposition methods, but not predicting detailed physical phenomena; the random edge gaps data augmentation technique is effective in helping the deep learning models to generalise and reach higher performances. Recommended future work consists of investigating methods of incorporating information on the out-of-plane motion into the inpainting task, exploring how the spatial correlations within the flow impact expected inpainting performance and testing inpainting models on a variety of different gap types including random noise and other forms of block gaps.
