Abstract
1. Introduction
Record linkage is the process of joining records in different data sources that belong to the same population unit. A record pair, namely a record from each file, is a match if the records in the pair belong to the same population unit and is a non-match if they belong to different population units. Often in practice, match status is unknown because a unique identifier for all population units is not available. Instead, available linking variables (e.g., date of birth, address, name) may provide only partially identifying information about a unit. A record linkage process aims to identify all matches from all record pairs using available linking variables.
An early probabilistic record linkage model was developed by Fellegi and Sunter (1969) and discussed in detail by Scheuren and Winkler (1993). Under the model, an agreement pattern is observed for each record pair across all linking variables (e.g., a pair may agree on name but disagree on age). Given the observed agreements, records pairs are ranked according to a so called “match weight.” Depending on the value of the match weight, a record pair can be classified as a match, a non-match, or a potential match using a linking algorithm. Potential matches can be managed through a clerical review process to determine match status. Fellegi and Sunter rely on a latent class model with two classes: match class and non-match class. Their model allows for the calculation of the probability that a record pair is a match (see Chipperfield and Chambers 2015; Chipperfield et al. 2018; Lahiri and Larsen 2005). The Fellegi-Sunter model is improved in subsequent studies, for example, DuVall et al. (2010), Sadinle and Fienberg (2013), Vo et al. (2021), and Moretti and Shlomo (2023). Other models have been proposed to address limitations of the Fellegi-Sunter model. For example Lee et al. (2022), Tancredi and Liseo (2011), and Goldstein et al. (2017).
A deterministic linkage algorithm in its pure form, identifies matches using a high-quality identifier, such as driver’s license number. A record pair is linked if they agree on the identifier. When such a high-quality identifier is not available, a new identifier, called a linking key, can be created by concatenating linking variables. A traditional linkage algorithm links a record pair if the pair agrees on the value of the linking key and the value of the linking key is unique on each file. The performance between deterministic linking and probabilistic linking is studied by Clark and Hahn (1995) in the context of the US statewide trauma registry. A study by Zhu et al. (2015) showed that probabilistic linkage uniformly outperforms deterministic linkage across all simulation scenarios. However, deterministic linkage requires less computation and is faster than probabilistic linkage. Moreover, deterministic linkage can produce very high-quality links if discriminating and accurately reported linking variables exist (Gomatam et al. 2002). Doidge and Harron (2018) highlighted that the distinction between probabilistic linking and deterministic linking can be very small if linkage procedures are designed well. In practice, a deterministic linkage process often proceeds in multiple passes, where each pass applies a specific linking key to identify matching record pairs.
Precision is the proportion of correct links created through a linkage algorithm. Estimating precision is of practical importance for the following reasons (Chipperfield et al. 2018): (1) Precision estimates can help data linkers to decide if two files should be linked at all; (2) Precision can be used to choose the set of linking variables that are used for the linking process, and help to refine how linkage is carried out; (3) Precision can be used to reduce bias of various parameter estimates such as coefficients of a regression model (Chambers 2009; Chipperfield 2019; Kim and Chambers 2010). In the context of multi-pass linkages, a pass precision is defined as the proportion of correct links formed within a particular pass.
Chipperfield et al. (2018) proposed to estimate expected pass precision using a bootstrap approach. Their approach works by simulating plausible agreement patterns for all record pairs, and then perform linkages using the simulated agreement patterns to estimate pass precision. Their approach assumes that a record in the first file can be falsely linked to any non-matching record in the second file with the same probability, which does not hold in general. In this paper, we propose two new methods for estimating Precision in deterministic record linkage, both of which are motivated by the Fellegi and Sunter (1969). There are several key differences between our proposed method and the bootstrap approach proposed by Chipperfield et al. (2018): (1) Alternative Simulation Approach: Instead of simulating plausible agreement patterns, our method simulates plausible record values for the second file based on the observed record values of the first file. This provides a more realistic replication of how linking variables are distributed. (2) Relaxed Assumption on False Linkage Probability: Chipperfield et al. (2018) assumes that any non-matching record in the second file can be linked with equal probability to a record in the first file. Our method does not impose this assumption, making it more flexible for real-world applications. (3) Robustness to Skewed Linking Variables: Our method demonstrates greater robustness to skewed distributions of linking variables.
This paper is organized as follows: Section 2 introduces the workflow of deterministic linkage and defines the associated measure of precision. Section 3 introduces and discusses our methods for precision estimation. Section 4 illustrates the performance of our estimators through simulation. Section 5 concludes the paper.
2. Workflow of Deterministic Linkage with Multi-Passes
This section introduces the workflow of a deterministic linkage algorithm with multi-passes. We then describe the linkage process of D-MAC, which is a multi-pass deterministic linkage method used in the Australian Bureau of Statistics.
2.1. Linking Key and Uniqueness Ratio
A linking key is formed by concatenating a set of linking variables, such as name, sex, and postcode. The Uniqueness Ratio (UR) of a linking key is defined as:
It is important to clarify that the UR for each linking key is calculated independently. This ensures that the UR reflects the discriminative power of the linking key itself and is not influenced by the order in which linking keys are used in the linkage process. The ranking of linking keys based on UR is performed before the linkage process begins. A high UR suggests that the linking key is highly discriminative—that is, agreement on this key is more likely to indicate a true match rather than a non-match. Conversely, a low UR suggests that the linking key may lead to a significant number of incorrect links.
2.2. Multi-Passes and Pass Precision
In practice, using one linking key to link two files often results in a large number of matching record pairs being left unlinked. To address this issue, a multi-pass linkage algorithm can be used. A multi-pass linkage algorithm typically involves: (1) ranking linking keys using some metric, such as the UR; (2) linking records using a key, excluding records that were linked in a previous (or higher ranked) pass.
If two files are linked using a single linking key, a deterministic linkage algorithm assigns a match to a record pair if the following two conditions hold:
(a) The record pair agrees on the linking key.
(b) The value of the linking key is unique within each file.
When multi-passes are used, some algorithms, such as D-MAC, introduces a third condition for establishing links. That is:
(c) the linking key is the best pass for both records in the pair.
The best pass for a record means that the record does not agree with any other records in an earlier pass. The concept of “best pass” assumes a linkage key is less discriminative or “less trustworthy” than the keys used in earlier passes. This reflects a natural trade-off in the design of deterministic linkage algorithms, where the most accurate keys are applied first to minimize false positives. Let
2.3. D-MAC Linkage Process
In this subsection we introduce the D-MAC linkage process. D-MAC is a deterministic linkage SAS macro. It has several good functionalities such as using different comparators for data linkage. For instance, when exact comparators are used, two records must match exactly in order to establish a link. When approximate comparators are used, some discrepancies could be allowed for a link to be established.
Suppose D-MAC uses
D-MAC ranks the linking keys in descending order of their URs.
In Pass 1, D-MAC links a record pair if (a) and (b) are satisfied.
Records not linked in Pass 1 will be assessed in Pass 2. D-MAC links a record pair in Pass 2 if (a), (b), (c) are satisfied.
Records not linked in Pass 1 or 2 may be linked in Pass 3. D-MAC links a record pair in Pass 3 if (a), (b), (c) are satisfied. The same process is repeated for all subsequent passes (i.e., Pass 4, …, Pass
3. Pass Precision Estimation
In this section, we introduce two methods for estimating pass precision in multi-pass deterministic record linkage. Although we illustrate these methods using D-MAC, the approach is applicable to any deterministic linkage framework where records are linked in multiple passes based on predefined rules. For the first method, we assume all linking variables are uniformly distributed and we will show that a pass precision estimator can be derived analytically. Secondly, we will introduce a replication approach for estimating pass precision. The replication approach requires a match set but the distributions of linking variables are not assumed. A match set is a sample of records pairs with known match status and is used for estimating key parameters of a linkage process.
We suppose FileA contains
3.1. Method 1: An Analytic Pass Precision Estimator
For a particular linking key, we let
We assume all linking variables are uniformly distributed. While the assumption of uniformity is not universally valid for all linking variables, certain variables such as Sex, Year of Birth, and State may exhibit distributions that are close to uniform. For a linking key assigned to Pass 1, a record pair is linked if they agree on the linking key and the value of the linking key is unique on each file. Denote the Precision of the linking key if it is assigned to the first pass, or First Pass Precision, as
It can be shown that,
where
To estimate the pass precision of Pass
where
We note that, Equation (1) can serve as an alternative criteria for ranking and selecting linking keys. Future research could compare the effectiveness of key ranking using Equation (1) against using the UR.
3.2. Method 2: Replication Approach
In this section we introduce a replication approach for estimating pass precision. Chipperfield and Chambers (2015) introduced a bootstrap approach for estimating precision of probabilistically linked data (corresponding to the
Our replication approach follows a similar idea. However, we replicate plausible linking variable values of the match set rather than the agreement patterns. In addition, we do not assume
Unlike their bootstrap approach where the replication process is repeated multiple times and the average proportion of correct links is used as a Precision estimate, our approach only replicates the whole process once. This is because the sizes of files for a typical linkage project can be huge, and it can be impractical to repeat the replication process multiple times if the estimates need to be produced in a timely manner. The pass precision estimate can have a low standard error if the number of links made in the pass is large.
Suppose we want to estimate the pass precision of linking FileA and FileB with
There are
The matching status between FileA and SimB is known.
The distribution of agreement patterns for record pairs between FileA and SimB and the distribution of agreement patterns for record pairs between FileA and FileB follow the same latent model.
If SimB can be replicated, then we can link FileA and SimB using D-MAC and we obtain pass precision of each pass. The pass precision can be used to estimate the pass precision of linking FileA and FileB. As the agreement patterns for (FileA, FileB) and (FileA, SimB) follow the same latent model, linking (FileA, FileB) and (FileA, SimB) should lead to the same expected pass precision.
Assuming all the parameters of the latent model and the underlying distributions of linking variables are known, the replication approach works as follows:
Step 1: We randomly choose
Step 2: Simulate SimB using FileA. Denote records in SimB as
where
Step 3: For
Step 4: Link FileA and SimB using D-MAC. The proportion of correct links in each pass is the pass precision estimate of linking FileA and FileB.
It is easy to see that the parameters
The conditional probabilities
We note that the replication pass precision estimator is applicable beyond D-MAC. It should be applicable to other deterministic linkage processes where:
The deterministic linkage process contains multiple passes.
Availability of training data to estimate match/non-match probabilities.
That agreement probabilities are estimated reliably.
3.3. Reliability Discussion
Our approach does not aim to directly recover the exact values of linking variables from agreement patterns. Instead, we estimate the probabilities of agreement for match pairs
Accuracy of
Replication Consistency: When simulating plausible linking variable values, we assume that the simulated value of a linking variable is only dependent on the corresponding value of the match record. This assumption preserves the
Reliability of the Approach: We conducted sensitivity tests by varying the assumed linking variable distributions and examined their impact on pass precision estimates. Our findings show that:
As long as
Small deviations in assumed variable distributions do not significantly affect the estimated precision.
In cases where
This analysis confirms that our approach does not rely on the exact reconstruction of variable values, but rather on the accuracy of agreement probabilities. Future research could explore adaptive estimation techniques to further refine the transition from agreement patterns to variable values. In addition, given the increasing reliance on automated linkage methods in practice, evaluating the performance of our estimator in situations where manual clerical review is limited would be a valuable extension.
3.4. Considerations for Training Set Size
The match set is an essential component of our replication approach, as it provides the empirical basis for estimating the key linkage parameters
3.4.1. Determining the Match Set Size
The size of the match set should be large enough to capture the variability in agreement patterns while remaining computationally feasible. The required size depends on two primary factors:
Dataset Size: In large datasets (e.g., millions of records), a relatively small proportion of the dataset may be sufficient for accurate parameter estimation. Alternatively the
Complexity of Linking Variables: If the space of possible linking variable values is large (e.g., categorical variables with high cardinality, such as names), a larger match set may be necessary to capture sufficient diversity in agreement patterns.
3.4.2. Efficiency of Random Sampling for Match Set Construction
In our study, the match set is constructed using a random sample of match record pairs between FileA and FileB. The match record pairs are identified by clerical review. Alternative designs may improve efficiency. For example:
Stratified Sampling: Ensuring proportional representation of different linking variable values in the match set.
Adaptive Sampling: Prioritizing uncertain cases for clerical review to optimize match set quality.
We acknowledge that exploring these designs in simulation studies would be valuable and suggest this as an area for future research.
3.4.3. Match Set Size Requirements for Direct Pass Precision Estimation
An alternative approach to estimating pass precision would be to derive it directly from the match set rather than estimating
The match set would need to contain a sufficiently large number of matches covering all agreement patterns across multiple linkage passes.
Many record pairs that contribute to pass precision estimates are not matches, meaning that a larger sample would be needed to achieve stable estimates.
By contrast, our approach leverages a smaller match set to estimate key parameters, making it computationally feasible while maintaining accuracy.
4. Simulation
In this section, we illustrate the performance of our pass precision estimators using synthetic datasets. We designed a set of simulation scenarios to assess the performance of our estimators under different distributions of linking variables. The performance of the analytic estimator was found to be highly sensitive to the distribution of linking variables. In particular, the degree of bias increased with the level of skewness in the linking variable distributions. The replication approach demonstrated robustness across different levels of skewness.
Suppose we were linking two files, FileA and FileB, and each file contained five linking variables
For instance, for Country of Birth, the value Australia would occur with much higher frequency than other values. To model this, we employed a truncated normal distribution to generate continuous values, which were rounded and mapped to categorical labels. This approach allows us to control the skewness of categorical distributions in a flexible, parameterized way, rather than assuming a uniform or predefined discrete probability distribution. While an alternative method would be to define a discrete probability distribution explicitly, our method provides a systematic way to adjust category probabilities while preserving a continuous-to-discrete transformation that captures real-world distributions.
Let the number of possible values for
We simulate different linkage scenarios with different
4.1. The Simulation Process
We simulated the linking variable values of FileA and FileB in the following way:
We first simulated linking variable values for FileA. Let
Let
We considered the following six linkage scenarios:
Scenario 1:
Scenario 2:
Scenario 3:
Scenario 4:
Scenario 5:
Scenario 6:
For linkage scenarios 1 and 2, all linking variables are uniformly or nearly uniformly distributed. For linkage scenarios 3 and 4,
Linking key of Pass 1:
Linking key of Pass 2:
Linking key of Pass 3:
Linking key of Pass 4:
Linking key of Pass 5:
For each linkage scenario we repeated the simulation process 500 times. In each iteration we simulated FileA and FileB and then linked them using D-MAC. We used exact comparators to define agreement for all linking variables, that is, two values must match exactly to be an agreement. As the match status between FileA and FileB was known, we calculated the true pass precision of each pass. Then we estimated the pass precision using the two estimators. We recorded the true pass precision and the pass precision estimates. We calculated three averages: the average of the true pass precision, the average of the pass precision estimates using the analytic estimator (Equations (1) and (2)), and the average of pass precision estimates using the replication approach.
4.2. Simulation Results
The results are presented in Figures 1 and 2, while Tables 1 and 2 summarize the Root Mean Squared Errors (RMSE) for each pass, where RMSE1 refers to the analytic estimator and RMSE2 to the replication estimator.

Simulation results for linkage scenarios 1 to 3, comparing the true pass precision and the precision estimates using the formula and the replication approach.

Simulation results for linkage scenarios 4 to 6, comparing the true pass precision and the precision estimates using the formula and the replication approach.
RMSEs of the Pass Precision Estimators for Linkage Scenario 1 to 3.
RMSEs of the Pass Precision Estimators for Linkage Scenario 4 to 6.
The performance of the analytic estimator was highly dependent on the distribution of linking variables. When linking variables were uniformly distributed (Scenario 1 and 2), the analytic estimator was able to provide accurate estimates of pass precision in the first three passes. In Scenario 1, for instance, the estimated pass precision for Pass 1 was 0.981, closely matching the true value of 0.983, indicating that the estimator performed well under these conditions. However, as additional passes were introduced, the precision of the analytic estimator deteriorated. In Pass 5, the true precision was 0.632, but the analytic estimate was 0.488. This underestimation suggests that the assumptions underlying the analytic estimator become less valid in later passes.
When linking variables were not uniformly distributed (Scenario 3–6), the analytic estimator exhibited significant biases. For example, in Scenario 3, the true pass precision for Pass 4 was 0.812, while the analytic estimate was 0.659. The degree of bias increased with the level of skewness in linking variable distributions, highlighting the limitations of the analytic estimator under more realistic data conditions.
In contrast, the replication approach estimator provided consistently accurate estimates across all scenarios and passes. Unlike the analytic estimator, its performance remained stable even in later passes, where precision estimation is typically more challenging. For instance, in Scenario 5, where linking variables were highly skewed, the estimated precision for Pass 1 to 5 using the replication approach were very close to the true values, demonstrating that the replication approach remains robust even when agreement probabilities vary significantly across linking variables.
An analysis of Root Mean Squared Errors (RMSE) further supports the superiority of the replication approach. The sources of variation for the replication estimator has two components: (i) the variability due to using a single replication, assuming the linkage parameters
Examining the trends across passes, Pass 1 consistently exhibited the highest precision across all scenarios, with values exceeding 0.95 in most cases. As expected, pass precision declined in later passes as the linking keys became weaker and agreement on these keys became less reliable. In particular, the impact of skewed linking variables on precision decline was evident in Scenario 6, where precision for Pass 5 dropped to 0.512, compared to 0.682 in Scenario 1. This suggests that the effectiveness of a given linking strategy is strongly influenced by the distribution of linking variable values.
Overall, the results demonstrate that while the analytic estimator is useful in early passes when linking variables are uniformly distributed, it becomes unreliable in later passes or when linking variables exhibit skewed distributions. In contrast, the replication approach consistently provides accurate pass precision estimates across all passes and scenarios, making it a more robust alternative for real-world deterministic linkage applications.
We note that for this simulation we only considered exact comparators. Using approximate comparators can increase the RMSEs for both estimators. We believe the replication method would still perform reasonably well when approximate comparators are used. It should be noted that a single bootstrap replicate may yield higher variability in precision estimates. However, the trade-off between computational cost and accuracy makes it a practical choice for this study.
While our simulations focused on datasets with approximately 2,000 records per file, the D-MAC algorithm is designed to efficiently scale to larger datasets containing hundreds of thousands or millions of records. Its deterministic nature reduces the need for computationally expensive iterative processes inherent in probabilistic approaches, making it a practical choice for large-scale linkage projects. The replication estimator’s uncertainty increases as the number of passes grows because later passes contain fewer links, amplifying variance. Similarly, for larger datasets, the estimator remains stable if match-set estimation is accurate, but computational feasibility may require adjustments such as stratified match-set selection. Future work could extend our analysis to include simulations on larger datasets to empirically validate the scalability of D-MAC.
5. Conclusion and Future Work
Multi-pass deterministic linkage is commonly used by statistical agencies to improve linkage coverage when a single linking key is insufficient. To estimate the precision of links made in each pass, we introduced two estimators. The first estimator assumes all linking variables are uniformly distributed. The second estimator worked by replicating the entire linkage process, including the values of linking variables on one of the files. Simulation showed that the replication approach estimator worked well for all passes in all linkage scenarios. The analytic estimator worked well when uniformity of linking variables holds.
Another important direction is to evaluate the performance of our estimators on real-world datasets. While the current study is based on simulations designed to reflect realistic linkage conditions, applying D-MAC to real-life data would provide more insights into its practical utility. In addition, a formal cost-effectiveness comparison between our method and traditional probabilistic linkage approaches with clerical review would be a valuable direction for future work, particularly given that our method relies less on manual review resources and supports more efficient estimation of precision.
Future work could also investigate the impact of multiple replicates on the root mean squared error (RMSE) of the precision estimates, which would provide a more comprehensive understanding of the trade-offs between computational cost and estimator accuracy.
Footnotes
Appendix
Acknowledgements
The authors would like to thank the associate editor and the anonymous referees for their valuable comments on an earlier version of this paper. We also wish to thank Daniel Elazar, Aymon Wuolanne, Luke Hendrickson, and Anders Holmberg for their helpful feedback, and Noel Hansen for his support with DMAC.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Disclaimer
Views expressed in this paper are those of the authors and do not necessarily represent those of the Australian Bureau of Statistics. Where quoted or used, they should be attributed clearly to the author.
Received: May 31, 2024
Accepted: September 15, 2025
