Sage Journals: Discover world-class research

Abstract

Deterministic record linkage is widely used when unique identifiers are unavailable. Multi-pass deterministic linkage increases the link rate, but evaluating the precision of each pass — that is, the proportion of correct links—is crucial for guiding linkage design. This paper proposes two methods for estimating pass-level precision. The first is an analytic estimator assuming uniformly distributed linking variables; the second is a replication-based estimator that avoids strong distributional assumptions. We demonstrate both methods through simulations using D-MAC, a deterministic linkage macro developed at the Australian Bureau of Statistics. Results show that while the analytic estimator performs well under uniformity, the replication approach provides robust estimates across a range of diverse scenarios, including those where linking variables have skewed distributions.

Keywords

record linkage data integration deterministic linkage linkage quality assessment bootstrapping

1. Introduction

Record linkage is the process of joining records in different data sources that belong to the same population unit. A record pair, namely a record from each file, is a match if the records in the pair belong to the same population unit and is a non-match if they belong to different population units. Often in practice, match status is unknown because a unique identifier for all population units is not available. Instead, available linking variables (e.g., date of birth, address, name) may provide only partially identifying information about a unit. A record linkage process aims to identify all matches from all record pairs using available linking variables.

An early probabilistic record linkage model was developed by Fellegi and Sunter (1969) and discussed in detail by Scheuren and Winkler (1993). Under the model, an agreement pattern is observed for each record pair across all linking variables (e.g., a pair may agree on name but disagree on age). Given the observed agreements, records pairs are ranked according to a so called “match weight.” Depending on the value of the match weight, a record pair can be classified as a match, a non-match, or a potential match using a linking algorithm. Potential matches can be managed through a clerical review process to determine match status. Fellegi and Sunter rely on a latent class model with two classes: match class and non-match class. Their model allows for the calculation of the probability that a record pair is a match (see Chipperfield and Chambers 2015; Chipperfield et al. 2018; Lahiri and Larsen 2005). The Fellegi-Sunter model is improved in subsequent studies, for example, DuVall et al. (2010), Sadinle and Fienberg (2013), Vo et al. (2021), and Moretti and Shlomo (2023). Other models have been proposed to address limitations of the Fellegi-Sunter model. For example Lee et al. (2022), Tancredi and Liseo (2011), and Goldstein et al. (2017).

A deterministic linkage algorithm in its pure form, identifies matches using a high-quality identifier, such as driver’s license number. A record pair is linked if they agree on the identifier. When such a high-quality identifier is not available, a new identifier, called a linking key, can be created by concatenating linking variables. A traditional linkage algorithm links a record pair if the pair agrees on the value of the linking key and the value of the linking key is unique on each file. The performance between deterministic linking and probabilistic linking is studied by Clark and Hahn (1995) in the context of the US statewide trauma registry. A study by Zhu et al. (2015) showed that probabilistic linkage uniformly outperforms deterministic linkage across all simulation scenarios. However, deterministic linkage requires less computation and is faster than probabilistic linkage. Moreover, deterministic linkage can produce very high-quality links if discriminating and accurately reported linking variables exist (Gomatam et al. 2002). Doidge and Harron (2018) highlighted that the distinction between probabilistic linking and deterministic linking can be very small if linkage procedures are designed well. In practice, a deterministic linkage process often proceeds in multiple passes, where each pass applies a specific linking key to identify matching record pairs.

Precision is the proportion of correct links created through a linkage algorithm. Estimating precision is of practical importance for the following reasons (Chipperfield et al. 2018): (1) Precision estimates can help data linkers to decide if two files should be linked at all; (2) Precision can be used to choose the set of linking variables that are used for the linking process, and help to refine how linkage is carried out; (3) Precision can be used to reduce bias of various parameter estimates such as coefficients of a regression model (Chambers 2009; Chipperfield 2019; Kim and Chambers 2010). In the context of multi-pass linkages, a pass precision is defined as the proportion of correct links formed within a particular pass.

Chipperfield et al. (2018) proposed to estimate expected pass precision using a bootstrap approach. Their approach works by simulating plausible agreement patterns for all record pairs, and then perform linkages using the simulated agreement patterns to estimate pass precision. Their approach assumes that a record in the first file can be falsely linked to any non-matching record in the second file with the same probability, which does not hold in general. In this paper, we propose two new methods for estimating Precision in deterministic record linkage, both of which are motivated by the Fellegi and Sunter (1969). There are several key differences between our proposed method and the bootstrap approach proposed by Chipperfield et al. (2018): (1) Alternative Simulation Approach: Instead of simulating plausible agreement patterns, our method simulates plausible record values for the second file based on the observed record values of the first file. This provides a more realistic replication of how linking variables are distributed. (2) Relaxed Assumption on False Linkage Probability: Chipperfield et al. (2018) assumes that any non-matching record in the second file can be linked with equal probability to a record in the first file. Our method does not impose this assumption, making it more flexible for real-world applications. (3) Robustness to Skewed Linking Variables: Our method demonstrates greater robustness to skewed distributions of linking variables.

This paper is organized as follows: Section 2 introduces the workflow of deterministic linkage and defines the associated measure of precision. Section 3 introduces and discusses our methods for precision estimation. Section 4 illustrates the performance of our estimators through simulation. Section 5 concludes the paper.

2. Workflow of Deterministic Linkage with Multi-Passes

This section introduces the workflow of a deterministic linkage algorithm with multi-passes. We then describe the linkage process of D-MAC, which is a multi-pass deterministic linkage method used in the Australian Bureau of Statistics.

2.1. Linking Key and Uniqueness Ratio

A linking key is formed by concatenating a set of linking variables, such as name, sex, and postcode. The Uniqueness Ratio (UR) of a linking key is defined as:

$UR = \frac{Number of Unique Agreements}{Total Number of Agreements} .$

It is important to clarify that the UR for each linking key is calculated independently. This ensures that the UR reflects the discriminative power of the linking key itself and is not influenced by the order in which linking keys are used in the linkage process. The ranking of linking keys based on UR is performed before the linkage process begins. A high UR suggests that the linking key is highly discriminative—that is, agreement on this key is more likely to indicate a true match rather than a non-match. Conversely, a low UR suggests that the linking key may lead to a significant number of incorrect links.

2.2. Multi-Passes and Pass Precision

In practice, using one linking key to link two files often results in a large number of matching record pairs being left unlinked. To address this issue, a multi-pass linkage algorithm can be used. A multi-pass linkage algorithm typically involves: (1) ranking linking keys using some metric, such as the UR; (2) linking records using a key, excluding records that were linked in a previous (or higher ranked) pass.

If two files are linked using a single linking key, a deterministic linkage algorithm assigns a match to a record pair if the following two conditions hold:

(a) The record pair agrees on the linking key.

(b) The value of the linking key is unique within each file.

When multi-passes are used, some algorithms, such as D-MAC, introduces a third condition for establishing links. That is:

The best pass for a record means that the record does not agree with any other records in an earlier pass. The concept of “best pass” assumes a linkage key is less discriminative or “less trustworthy” than the keys used in earlier passes. This reflects a natural trade-off in the design of deterministic linkage algorithms, where the most accurate keys are applied first to minimize false positives. Let $A_{i}$ and $B_{j}$ be the $i$ -th and $j$ -th record on FileA and FileB. When $A_{1}$ and $A_{2}$ both agree with $B_{1}$ in Pass 1 and only $A_{2}$ agrees with $B_{1}$ in Pass 2, neither link will be created, as the values of $A_{1}$ and $A_{2}$ are not unique, and Pass 2 is not the best pass for $A_{2}$ and $B_{1}$ .

2.3. D-MAC Linkage Process

In this subsection we introduce the D-MAC linkage process. D-MAC is a deterministic linkage SAS macro. It has several good functionalities such as using different comparators for data linkage. For instance, when exact comparators are used, two records must match exactly in order to establish a link. When approximate comparators are used, some discrepancies could be allowed for a link to be established.

Suppose D-MAC uses $k$ linking keys (or $k$ passes) to link two files. The linkage process is described below:

D-MAC ranks the linking keys in descending order of their URs.

In Pass 1, D-MAC links a record pair if (a) and (b) are satisfied.

Records not linked in Pass 1 will be assessed in Pass 2. D-MAC links a record pair in Pass 2 if (a), (b), (c) are satisfied.

Records not linked in Pass 1 or 2 may be linked in Pass 3. D-MAC links a record pair in Pass 3 if (a), (b), (c) are satisfied. The same process is repeated for all subsequent passes (i.e., Pass 4, …, Pass $k$ ).

3. Pass Precision Estimation

In this section, we introduce two methods for estimating pass precision in multi-pass deterministic record linkage. Although we illustrate these methods using D-MAC, the approach is applicable to any deterministic linkage framework where records are linked in multiple passes based on predefined rules. For the first method, we assume all linking variables are uniformly distributed and we will show that a pass precision estimator can be derived analytically. Secondly, we will introduce a replication approach for estimating pass precision. The replication approach requires a match set but the distributions of linking variables are not assumed. A match set is a sample of records pairs with known match status and is used for estimating key parameters of a linkage process.

We suppose FileA contains $n_{A}$ records, and FileB contains $n_{B}$ records. Each record in FileA has at most one matching record in FileB, and vice versa. There are $n_{0}$ true matches between FileA and FileB and $k$ linking variables between the two files. Let $A_{i}$ be the $i$ -th record in FileA, and $B_{j}$ be the $j$ -th record in FileB.

3.1. Method 1: An Analytic Pass Precision Estimator

For a particular linking key, we let $M$ be the probability that a match record pair agrees on the linking key, $U$ be the probability that a non-match record pair agrees on the linking key. Here we adopt a conditional independence assumption, which is typical in the probabilistic linkage approach. That is, we assume $M = \underset{v \in S}{Π} m_{v}$ and $U = \underset{v \in S}{Π} u_{v}$ , where $S$ is the set of linking variables used in a given linking key, $m_{v}$ is the probability that a match record pair agrees on linking variable $v$ , $u_{v}$ is the probability that a non-match record pair agrees on linking variable $v$ .

We assume all linking variables are uniformly distributed. While the assumption of uniformity is not universally valid for all linking variables, certain variables such as Sex, Year of Birth, and State may exhibit distributions that are close to uniform. For a linking key assigned to Pass 1, a record pair is linked if they agree on the linking key and the value of the linking key is unique on each file. Denote the Precision of the linking key if it is assigned to the first pass, or First Pass Precision, as $π_{1}$ . Following Bayes’ rule, the probability that a linked pair is a true match, can be expressed as:

$π_{1} = P (Match | Link) = \frac{P (Link | Match) P (Match)}{P (Link | Match) P (Match) + P (Link, Non - Match)}$

It can be shown that, $π_{1}$ has the following expression:

$π_{1} = P (Match | Link) = T_{1} / (T_{1} + T_{2} + T_{3} + T_{4} + T_{5}),$ (1)

where

$\begin{matrix} T_{1} = n_{0} M (1 - U) {(1 - U) - U (1 - M)} / n_{A} n_{B} \\ T_{2} = (n_{0}^{2} - n_{0}) {(1 - M)}^{2} (1 - U) U / n_{A} n_{B} \\ T_{3} = n_{0} (n_{B} - n_{0}) (1 - M) U {(1 - U) - U (1 - M)} / n_{A} n_{B} \\ T_{4} = (n_{A} - n_{0}) (n_{B} - n_{0}) {(1 - U) - U (1 - M)}^{2} U / (1 - U) n_{A} n_{B} \\ T_{5} = (n_{A} - n_{0}) n_{0} {(1 - U) - U (1 - M)} (U - UM) / n_{A} n_{B} \end{matrix}$

To estimate the pass precision of Pass $t$ , denoted as $\hat{P P_{t}}$ , we use the following formula from Chipperfield et al. (2018):

$\hat{P P_{t}} = \frac{N_{t} π_{1} - C_{t - 1}}{L_{t}},$ (2)

where $N_{t}$ is the number of links that could be made by the linking key if the linking key is assigned to the first pass, $C_{t - 1} = \sum_{k = 1}^{t - 1} L_{kt} \cdot (\hat{P P_{k}})$ , $L_{kt}$ is the number of links that could have been made in the first pass but are instead linked in the $k$ -th pass, and $L_{t}$ is the number of links made in pass $t$ . The formula assumes that $L_{kt} \cdot (\hat{P P_{k}})$ is the expected number of correct links in Pass $k$ which would have been linked in the first pass. We found that this assumption works well in earlier passes, but not so well in latter passes, as $L_{kt} \cdot (\hat{P P_{k}})$ tends to underestimate the number of correct links made in latter passes.

We note that, Equation (1) can serve as an alternative criteria for ranking and selecting linking keys. Future research could compare the effectiveness of key ranking using Equation (1) against using the UR.

3.2. Method 2: Replication Approach

In this section we introduce a replication approach for estimating pass precision. Chipperfield and Chambers (2015) introduced a bootstrap approach for estimating precision of probabilistically linked data (corresponding to the $λ$ parameter of the exchangeable linkage model in their paper). The same approach was then applied to deterministically linked data in Chipperfield et al. (2018) where linkage is imperfect. Their approach works by replicating agreement patterns using the latent model between the two files to be linked. The authors assumed $n_{0} = n_{A} = n_{B}$ , the holding of the conditional independence assumption of the Fellegi-Sunter model, and $m$ and $u$ can be estimated unbiasedly. The bootstrap approach works by simulating $n_{0}$ agreement patterns for match pairs and $n_{0}^{2} - n_{0}$ agreement patterns of non-match pairs using estimated $m$ and $u$ probabilities. The probabilistic linkage process is then applied to these simulated agreement patterns and the proportion of correct links is recorded. The process is repeated multiple times, and the average proportion of correct links is the Precision estimate.

Our replication approach follows a similar idea. However, we replicate plausible linking variable values of the match set rather than the agreement patterns. In addition, we do not assume $n_{0} = n_{A} = n_{B}$ . We do assume $m$ , $u$ , and $n_{0}$ can be estimated unbiasedly. We note that the assumption of unbiased estimation for parameters such as $m$ , $u$ , and $n_{0}$ is a strong one and may not hold in many linkage scenarios.

Unlike their bootstrap approach where the replication process is repeated multiple times and the average proportion of correct links is used as a Precision estimate, our approach only replicates the whole process once. This is because the sizes of files for a typical linkage project can be huge, and it can be impractical to repeat the replication process multiple times if the estimates need to be produced in a timely manner. The pass precision estimate can have a low standard error if the number of links made in the pass is large.

Suppose we want to estimate the pass precision of linking FileA and FileB with $k$ linking variables. The idea is to replicate a match file SimB using FileA with the following requirements:

There are $n_{2}$ records on SimB and $n_{0}$ matches between FileA and SimB.

The matching status between FileA and SimB is known.

The distribution of agreement patterns for record pairs between FileA and SimB and the distribution of agreement patterns for record pairs between FileA and FileB follow the same latent model.

If SimB can be replicated, then we can link FileA and SimB using D-MAC and we obtain pass precision of each pass. The pass precision can be used to estimate the pass precision of linking FileA and FileB. As the agreement patterns for (FileA, FileB) and (FileA, SimB) follow the same latent model, linking (FileA, FileB) and (FileA, SimB) should lead to the same expected pass precision.

Assuming all the parameters of the latent model and the underlying distributions of linking variables are known, the replication approach works as follows:

Step 1: We randomly choose $n_{0}$ records out of $n_{A}$ records on FileA and denote these records as $({\tilde{A}}_{1}, {\tilde{A}}_{2}, \dots, {\tilde{A}}_{n_{0}})$ . These records will have matching records on SimB. Denote the remaining record on FileA as $({\tilde{A}}_{n_{0} + 1}, \dots, {\tilde{A}}_{n_{A}})$ .

Step 2: Simulate SimB using FileA. Denote records in SimB as $({\tilde{B}}_{1}, {\tilde{B}}_{2}, \dots, {\tilde{B}}_{n_{0}}, \dots, {\tilde{B}}_{n_{2}})$ . $({\tilde{A}}_{i}, {\tilde{B}}_{i})$ is a match for $i \leq n_{0}$ . Let ${\tilde{b}}_{il}$ be the value of ${\tilde{B}}_{i}$ on the $l$ -th linking variable, ${\tilde{a}}_{il}$ be the value of $~ A_{i}$ on the $l$ -th linking variable. Let $C_{l}$ be the number of possible values that the $l$ -th linking variable can take. Then simulate a value of ${\tilde{b}}_{il}$ according to the following distribution:

${\tilde{b}}_{il} = {\begin{matrix} 1 & with probability q_{1 g} \\ 2 & with probability q_{2 g} \\ \dots, \\ C_{l} & with probability q_{C_{l} g} \end{matrix}$

where $q_{hg} = P ({\tilde{b}}_{il} = h | {\tilde{a}}_{il} = g)$ , $h, g = 1, 2, \dots, C_{l}$ .

Step 3: For $n_{0} < i \leq n_{2}$ , to simulate ${\tilde{b}}_{il}$ , randomly draw a value from the underlying distribution of the $l$ -th linking variable on FileB.

Step 4: Link FileA and SimB using D-MAC. The proportion of correct links in each pass is the pass precision estimate of linking FileA and FileB.

It is easy to see that the parameters $(n_{0}, n_{A}, n_{B})$ between (FileA, FileB) and (FileA, SimB) are the same. For $m_{l}$ , $l = 1, 2, \dots, k$ , as $m_{l} = \sum_{i = 1}^{C_{l}} {\hat{p}}_{i} q_{ii}$ , where ${\hat{p}}_{i}$ is the proportion of records having the linking variable value $i$ on FileA, therefore $m_{l}$ values are the same between (FileA, FileB) and (FileA, SimB). For $u_{l}$ , $l = 1, 2, \dots, k$ , note that $u_{l} = P (D_{lA} = D_{lB})$ , where $D_{lA}$ and $D_{lB}$ are the distributions of the $l$ -th linking variable on FileA and FileB, respectively. As $D_{lB}$ is also the distribution of the $l$ -th linking variable on SimB, therefore $u_{l}$ values are the same between (FileA, FileB) and (FileA, SimB). As a result, the agreement patterns for (FileA, FileB) and (FileA, SimB) follow the same underlying probabilistic model. This ensures that both settings yield the same expected pass precision.

The conditional probabilities $(q_{1 g}, q_{2 g}, \dots, q_{C_{l} g})$ can be estimated using a good match set. To create a match set, we draw a sample of records from FileA (denoted as SampleA) and then find their matches in FileB through clerical review. Then, $q_{hg}$ is estimated by counting the number of match record pairs in the match set having linking variable value $h$ on SampleA and linking variable value $g$ on FileB divided by the number of match record pairs in the match set having linking variable value $h$ on SampleA. We can also use the match set to estimate $m$ , $u$ , and $n_{0}$ . Specifically, for the $l$ -th linking variable, $m_{l}$ can be estimated by counting the proportion of match pairs agreeing on the linking variable. $u_{l}$ can be estimated by counting the proportion of non-match pairs agreeing on the linking variable. $n_{0}$ can be estimated using the number of matching record pairs in the match set divided by the number of records in SampleA. The underlying distribution of the linking variable on FileB can be estimated using the empirical distribution of the linking variable. Note that we utilize a single bootstrap replicate to estimate precision due to computational efficiency considerations. While multiple bootstrap replicates (e.g., 10 or more) would reduce the variance of the precision estimates, the additional computational cost can be prohibitive, particularly for large datasets.

We note that the replication pass precision estimator is applicable beyond D-MAC. It should be applicable to other deterministic linkage processes where:

The deterministic linkage process contains multiple passes.

Availability of training data to estimate match/non-match probabilities.

That agreement probabilities are estimated reliably.

3.3. Reliability Discussion

Our approach does not aim to directly recover the exact values of linking variables from agreement patterns. Instead, we estimate the probabilities of agreement for match pairs $m$ and non-match pairs $u$ , and use these probabilities to simulate plausible linking variable values in the replication approach. The key insight is that as long as $m$ and $u$ are accurately estimated, the estimated precision remains valid, even if the exact values of the linking variables are not recovered. The reliability of the replication approach is based on the following assumptions:

Accuracy of $m$ and $u$ Estimations: The correctness of our approach depends primarily on accurately estimating the match and non-match agreement probabilities. If these probabilities are well-estimated either directly (Fellegi and Sunter 1969), or from a representative training set, our precision estimates remain robust.

Replication Consistency: When simulating plausible linking variable values, we assume that the simulated value of a linking variable is only dependent on the corresponding value of the match record. This assumption preserves the $m$ and $u$ probabilities in the simulated files easily.

Reliability of the Approach: We conducted sensitivity tests by varying the assumed linking variable distributions and examined their impact on pass precision estimates. Our findings show that:

As long as $m$ and $u$ estimates are good, the estimated precision is highly robust.

Small deviations in assumed variable distributions do not significantly affect the estimated precision.

In cases where $m$ and $u$ are misestimated (e.g., due to a biased training set), precision estimates may be affected.

This analysis confirms that our approach does not rely on the exact reconstruction of variable values, but rather on the accuracy of agreement probabilities. Future research could explore adaptive estimation techniques to further refine the transition from agreement patterns to variable values. In addition, given the increasing reliance on automated linkage methods in practice, evaluating the performance of our estimator in situations where manual clerical review is limited would be a valuable extension.

3.4. Considerations for Training Set Size

The match set is an essential component of our replication approach, as it provides the empirical basis for estimating the key linkage parameters $m$ , $u$ , and $n_{0}$ . The selection of the match set size and design has important implications for the reliability of these estimates.

3.4.1. Determining the Match Set Size

The size of the match set should be large enough to capture the variability in agreement patterns while remaining computationally feasible. The required size depends on two primary factors:

Dataset Size: In large datasets (e.g., millions of records), a relatively small proportion of the dataset may be sufficient for accurate parameter estimation. Alternatively the $m$ and $u$ parameters can be estimated without a match set (Fellegi and Sunter 1969).

Complexity of Linking Variables: If the space of possible linking variable values is large (e.g., categorical variables with high cardinality, such as names), a larger match set may be necessary to capture sufficient diversity in agreement patterns.

3.4.2. Efficiency of Random Sampling for Match Set Construction

In our study, the match set is constructed using a random sample of match record pairs between FileA and FileB. The match record pairs are identified by clerical review. Alternative designs may improve efficiency. For example:

Stratified Sampling: Ensuring proportional representation of different linking variable values in the match set.

Adaptive Sampling: Prioritizing uncertain cases for clerical review to optimize match set quality.

We acknowledge that exploring these designs in simulation studies would be valuable and suggest this as an area for future research.

3.4.3. Match Set Size Requirements for Direct Pass Precision Estimation

An alternative approach to estimating pass precision would be to derive it directly from the match set rather than estimating $m$ , $u$ , and $n_{0}$ separately. However, this would require a much larger match set, because:

The match set would need to contain a sufficiently large number of matches covering all agreement patterns across multiple linkage passes.

Many record pairs that contribute to pass precision estimates are not matches, meaning that a larger sample would be needed to achieve stable estimates.

By contrast, our approach leverages a smaller match set to estimate key parameters, making it computationally feasible while maintaining accuracy.

4. Simulation

In this section, we illustrate the performance of our pass precision estimators using synthetic datasets. We designed a set of simulation scenarios to assess the performance of our estimators under different distributions of linking variables. The performance of the analytic estimator was found to be highly sensitive to the distribution of linking variables. In particular, the degree of bias increased with the level of skewness in the linking variable distributions. The replication approach demonstrated robustness across different levels of skewness.

Suppose we were linking two files, FileA and FileB, and each file contained five linking variables $(V_{1}, V_{2}, V_{3}, V_{4}, V_{5})$ . For FileA, we assumed $V_{1}$ , $V_{2}$ , and $V_{3}$ follow different truncated normal distributions, while $V_{4}$ and $V_{5}$ follow different uniform distributions. These distributions were chosen to mimic the distributions of the following linking variables: Country of Birth, Marital Status, Level of Qualification, Sex, and Day and Month of Birth, which were used for the Australian Census Longitudinal Dataset 2006-2016 (Australian Bureau of Statistics 2019). For Sex and Day and Month of Birth, it is reasonable to assume they are uniformly distributed, justifying the distributions of $V_{4}$ and $V_{5}$ .

For instance, for Country of Birth, the value Australia would occur with much higher frequency than other values. To model this, we employed a truncated normal distribution to generate continuous values, which were rounded and mapped to categorical labels. This approach allows us to control the skewness of categorical distributions in a flexible, parameterized way, rather than assuming a uniform or predefined discrete probability distribution. While an alternative method would be to define a discrete probability distribution explicitly, our method provides a systematic way to adjust category probabilities while preserving a continuous-to-discrete transformation that captures real-world distributions.

Let the number of possible values for $V_{l}$ be $C_{l}$ . We have ${C_{l}}_{l = 1}^{5} = (195, 5, 15, 2, 365)$ . We assumed the linking variables are independent, and the $l$ -th linking variable can take any positive integer between 1 and $C_{l}$ . The size of FileA is 2,100 and the size of FileB is 2,200, and the number of match pairs between the two files is 2,000, that is, $n_{A} = 2100$ , $n_{B} = 2200$ , and $n_{0} = 2000$ . Denote the number of records used to create a match set on FileA as $s_{1}$ . Throughout the simulation we let $s_{1} = 420$ . The match set is then used for estimating $m$ , $n_{0}$ and all the conditional probabilities for the replication approach.

We simulate different linkage scenarios with different $m$ and $σ$ values. The value of $σ$ determines the skewness and the $u$ value of each linking variable. We will show how the performance of our estimators depends on these values.

4.1. The Simulation Process

We simulated the linking variable values of FileA and FileB in the following way:

We first simulated linking variable values for FileA. Let $A_{i}$ be the $i$ -th record on FileA, $o_{il}^{A}$ be the linking variable value of $V_{l}$ for $A_{i}$ , $i = 1, 2, \dots, n_{A}$ and $l = 1, 2, \dots, 5$ . For $l = 4, 5$ , the value of $o_{il}^{A}$ was simulated by randomly drawing an observation from $U (0, C_{l})$ , and then rounded up to the nearest integer. For $l = 1, 2, 3$ , the value of $o_{il}^{A}$ was simulated by randomly drawing an observation from $N (2, σ^{2})$ and then rounded up to the nearest integer. Denote the underlying distribution of $V_{l}$ in FileA as $D_{l}$ .

Let $B_{j}$ be the $j$ -th record on FileB, $o_{jl}^{B}$ be the $l$ -th linking variable value for $B_{j}$ . We assumed $(A_{i}, B_{i})$ is a match for $i \leq n_{0}$ . To simulate $o_{jl}^{B}$ , for $i \leq n_{0}$ , with probability $q_{l}$ the value of $o_{il}^{B}$ was equal to $o_{il}^{A}$ . Otherwise, $o_{i l}^{B}$ was simulated by randomly drawing an observation from $o_{il}^{B}$ . For $n_{0} < j \leq n_{2}$ , the value of $o_{jl}^{B}$ was simulated by randomly drawing a value from $D_{l}$ .

We considered the following six linkage scenarios:

Scenario 1: $σ^{2} = 100000$ , ${q_{l}}_{l = 1}^{5} = (0.7, 0.7, 0.8, 0.9, 0.8)$ , ${m_{l}}_{l = 1}^{5} = (0.701, 0.760, 0.813, 0.950, 0.8)$ , ${u_{l}}_{l = 1}^{5} = (0.005, 0, 202, 0.067, 0.500, 0.0028)$ .

Scenario 2: $σ^{2} = 100000$ , ${q_{l}}_{l = 1}^{5} = (0.5, 0.4, 0.4, 0.6, 0.5)$ , ${m_{l}}_{l = 1}^{5} = (0.503, 0.519, 0.441, 0.798, 0.501)$ , ${u_{l}}_{l = 1}^{5} = (0.005, 0.201, 0.066, 0.501, 0.0027)$ .

Scenario 3: $σ^{2} = 20$ , ${q_{l}}_{l = 1}^{5} = (0.7, 0.7, 0.8, 0.9, 0.8)$ , ${m_{l}}_{l = 1}^{5} = (0.731, 0.760, 0.820, 0.950, 0.801)$ , ${u_{l}}_{l = 1}^{5} = (0.103, 0.199, 0.103, 0.498, 0.0027)$ .

Scenario 4: $σ^{2} = 20$ , ${q_{l}}_{l = 1}^{5} = (0.5, 0.4, 0.4, 0.6, 0.5)$ , ${m_{l}}_{l = 1}^{5} = (0.551, 0.519, 0.460, 0.802, 0.501)$ , ${u_{l}}_{l = 1}^{5} = (0.103, 0.200, 0.102, 0.502, 0.0028)$ .

Scenario 5: $σ^{2} = 3$ , ${q_{l}}_{l = 1}^{5} = (0.7, 0.7, 0.8, 0.9, 0.8)$ , ${m_{l}}_{l = 1}^{5} = (0.760, 0.765, 0.841, 0.950, 0.800)$ , ${u_{l}}_{l = 1}^{5} = (0.199, 0.218, 0.198, 0.501, 0.0028)$ .

Scenario 6: $σ^{2} = 3$ , ${q_{l}}_{l = 1}^{5} = (0.5, 0.4, 0.4, 0.6, 0.5)$ , ${m_{l}}_{l = 1}^{5} = (0.599, 0.531, 0.519, 0.800, 0.500)$ , ${u_{l}}_{l = 1}^{5} = (0.200, 0.218, 0.198, 0.501, 0.0027)$ .

For linkage scenarios 1 and 2, all linking variables are uniformly or nearly uniformly distributed. For linkage scenarios 3 and 4, $V_{1}$ , $V_{2}$ , and $V_{3}$ are mildly skewed. For linkage scenarios 5 and 6, $V_{1}$ , $V_{2}$ , and $V_{3}$ are heavily skewed. We used five passes with the following linking keys to link FileA and FileB:

Linking key of Pass 1: $V_{1}$ , $V_{2}$ , $V_{3}$ , $V_{4}$ , $V_{5}$ .

Linking key of Pass 2: $V_{1}$ , $V_{2}$ , $V_{3}$ , $V_{5}$ .

Linking key of Pass 3: $V_{1}$ , $V_{3}$ , $V_{4}$ , $V_{5}$ .

Linking key of Pass 4: $V_{1}$ , $V_{2}$ , $V_{4}$ , $V_{5}$ .

Linking key of Pass 5: $V_{2}$ , $V_{3}$ , $V_{4}$ , $V_{5}$ .

For each linkage scenario we repeated the simulation process 500 times. In each iteration we simulated FileA and FileB and then linked them using D-MAC. We used exact comparators to define agreement for all linking variables, that is, two values must match exactly to be an agreement. As the match status between FileA and FileB was known, we calculated the true pass precision of each pass. Then we estimated the pass precision using the two estimators. We recorded the true pass precision and the pass precision estimates. We calculated three averages: the average of the true pass precision, the average of the pass precision estimates using the analytic estimator (Equations (1) and (2)), and the average of pass precision estimates using the replication approach.

4.2. Simulation Results

The results are presented in Figures 1 and 2, while Tables 1 and 2 summarize the Root Mean Squared Errors (RMSE) for each pass, where RMSE1 refers to the analytic estimator and RMSE2 to the replication estimator.

Figure 1.

Simulation results for linkage scenarios 1 to 3, comparing the true pass precision and the precision estimates using the formula and the replication approach.

Figure 2.

Simulation results for linkage scenarios 4 to 6, comparing the true pass precision and the precision estimates using the formula and the replication approach.

Table 1.

RMSEs of the Pass Precision Estimators for Linkage Scenario 1 to 3.

	Scenario 1		Scenario 2		Scenario 3
Pass	RMSE1	RMSE2	RMSE1	RMSE2	RMSE1	RMSE2
Pass 1	0.001	0.007	0.009	0.013	0.009	0.010
Pass 2	0.015	0.021	0.035	0.041	0.142	0.092
Pass 3	0.0058	0.008	0.027	0.018	0.066	0.021
Pass 4	0.023	0.013	0.052	0.022	0.186	0.048
Pass 5	0.099	0.016	0.295	0.050	0.114	0.022

Table 2.

RMSEs of the Pass Precision Estimators for Linkage Scenario 4 to 6.

	Scenario 4		Scenario 5		Scenario 6
Pass	RMSE1	RMSE2	RMSE1	RMSE2	RMSE1	RMSE2
Pass 1	0.092	0.034	0.029	0.011	0.186	0.051
Pass 2	0.238	0.086	0.298	0.098	0.359	0.073
Pass 3	0.218	0.050	0.152	0.044	0.335	0.052
Pass 4	0.276	0.056	0.284	0.050	0.330	0.072
Pass 5	0.325	0.078	0.183	0.075	0.368	0.065

The performance of the analytic estimator was highly dependent on the distribution of linking variables. When linking variables were uniformly distributed (Scenario 1 and 2), the analytic estimator was able to provide accurate estimates of pass precision in the first three passes. In Scenario 1, for instance, the estimated pass precision for Pass 1 was 0.981, closely matching the true value of 0.983, indicating that the estimator performed well under these conditions. However, as additional passes were introduced, the precision of the analytic estimator deteriorated. In Pass 5, the true precision was 0.632, but the analytic estimate was 0.488. This underestimation suggests that the assumptions underlying the analytic estimator become less valid in later passes.

When linking variables were not uniformly distributed (Scenario 3–6), the analytic estimator exhibited significant biases. For example, in Scenario 3, the true pass precision for Pass 4 was 0.812, while the analytic estimate was 0.659. The degree of bias increased with the level of skewness in linking variable distributions, highlighting the limitations of the analytic estimator under more realistic data conditions.

In contrast, the replication approach estimator provided consistently accurate estimates across all scenarios and passes. Unlike the analytic estimator, its performance remained stable even in later passes, where precision estimation is typically more challenging. For instance, in Scenario 5, where linking variables were highly skewed, the estimated precision for Pass 1 to 5 using the replication approach were very close to the true values, demonstrating that the replication approach remains robust even when agreement probabilities vary significantly across linking variables.

An analysis of Root Mean Squared Errors (RMSE) further supports the superiority of the replication approach. The sources of variation for the replication estimator has two components: (i) the variability due to using a single replication, assuming the linkage parameters $(m, u, n_{0})$ are known, and (ii) the variability due to estimating these parameters from a match set rather than assuming them to be known. The RMSE values, presented in Tables 1 and 2, illustrate that the analytic estimator exhibits increasing errors as the pass number increases, whereas the replication approach remains stable. In Scenario 3, the RMSE for the analytic estimator in Pass 4 was 0.186, compared to 0.048 for the replication approach—a nearly four-fold reduction in error. This pattern is consistent across all scenarios, reinforcing the conclusion that the replication approach is the preferred method for real-world linkage applications.

Examining the trends across passes, Pass 1 consistently exhibited the highest precision across all scenarios, with values exceeding 0.95 in most cases. As expected, pass precision declined in later passes as the linking keys became weaker and agreement on these keys became less reliable. In particular, the impact of skewed linking variables on precision decline was evident in Scenario 6, where precision for Pass 5 dropped to 0.512, compared to 0.682 in Scenario 1. This suggests that the effectiveness of a given linking strategy is strongly influenced by the distribution of linking variable values.

Overall, the results demonstrate that while the analytic estimator is useful in early passes when linking variables are uniformly distributed, it becomes unreliable in later passes or when linking variables exhibit skewed distributions. In contrast, the replication approach consistently provides accurate pass precision estimates across all passes and scenarios, making it a more robust alternative for real-world deterministic linkage applications.

We note that for this simulation we only considered exact comparators. Using approximate comparators can increase the RMSEs for both estimators. We believe the replication method would still perform reasonably well when approximate comparators are used. It should be noted that a single bootstrap replicate may yield higher variability in precision estimates. However, the trade-off between computational cost and accuracy makes it a practical choice for this study.

While our simulations focused on datasets with approximately 2,000 records per file, the D-MAC algorithm is designed to efficiently scale to larger datasets containing hundreds of thousands or millions of records. Its deterministic nature reduces the need for computationally expensive iterative processes inherent in probabilistic approaches, making it a practical choice for large-scale linkage projects. The replication estimator’s uncertainty increases as the number of passes grows because later passes contain fewer links, amplifying variance. Similarly, for larger datasets, the estimator remains stable if match-set estimation is accurate, but computational feasibility may require adjustments such as stratified match-set selection. Future work could extend our analysis to include simulations on larger datasets to empirically validate the scalability of D-MAC.

5. Conclusion and Future Work

Multi-pass deterministic linkage is commonly used by statistical agencies to improve linkage coverage when a single linking key is insufficient. To estimate the precision of links made in each pass, we introduced two estimators. The first estimator assumes all linking variables are uniformly distributed. The second estimator worked by replicating the entire linkage process, including the values of linking variables on one of the files. Simulation showed that the replication approach estimator worked well for all passes in all linkage scenarios. The analytic estimator worked well when uniformity of linking variables holds.

Another important direction is to evaluate the performance of our estimators on real-world datasets. While the current study is based on simulations designed to reflect realistic linkage conditions, applying D-MAC to real-life data would provide more insights into its practical utility. In addition, a formal cost-effectiveness comparison between our method and traditional probabilistic linkage approaches with clerical review would be a valuable direction for future work, particularly given that our method relies less on manual review resources and supports more efficient estimation of precision.

Future work could also investigate the impact of multiple replicates on the root mean squared error (RMSE) of the precision estimates, which would provide a more comprehensive understanding of the trade-offs between computational cost and estimator accuracy.

Footnotes

Appendix

Acknowledgements

The authors would like to thank the associate editor and the anonymous referees for their valuable comments on an earlier version of this paper. We also wish to thank Daniel Elazar, Aymon Wuolanne, Luke Hendrickson, and Anders Holmberg for their helpful feedback, and Noel Hansen for his support with DMAC.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Disclaimer

Views expressed in this paper are those of the authors and do not necessarily represent those of the Australian Bureau of Statistics. Where quoted or used, they should be attributed clearly to the author.

ORCID iDs

Yue Ma

James Chipperfield

Received: May 31, 2024

Accepted: September 15, 2025

References

Australian Bureau of Statistics. 2019. “Information Paper: Australian Census Longitudinal Dataset, Methodology and Quality Assessment, 2006-2016.” ABS Catalogue No. 2080.5.

Chambers

2009. “Regression Analysis of Probability-Linked Data.” Statisphere Official Statistics Research Series 4. https://statsnz.contentdm.oclc.org/digital/api/collection/p20045coll4/id/288/download.

Chipperfield

J. O.

2019. “A Weighting Approach to Making Inference with Probabilistically Linked Data.”Statistica Neerlandica 73 (3): 333–50. DOI: https://doi.org/10.1111/stan.12172.

Chipperfield

J. O.

Chambers

R. L.

2015. “Using the Bootstrap to Account for Linkage Errors When Analysing Probabilistically Linked Categorical Data.”Journal of Official Statistics 31 (3): 397–414. DOI: https://doi.org/10.1515/jos-2015-0024.

Chipperfield

J. O.

Hansen

Rossiter

2018. “Estimating Precision and Recall for Deterministic and Probabilistic Record Linkage.”International Statistical Review 2: 219–36. DOI: https://doi.org/10.1111/insr.12246.

Clark

D. E.

Hahn

D. R.

1995. “Comparison of Probabilistic and Deterministic Record Linkage in the Development of a Statewide Trauma Registry.”Proceedings of the Annual Symposium on Computer Application in Medical Care.

Doidge

J. C.

Harron

2018. “Demystifying Probabilistic Linkage: Common Myths and Misconceptions.”International Journal of Population Data Science 3 (1): 410. DOI: https://doi.org/10.23889/ijpds.v3i1.410.

DuVall

S. L.

Kerber

R. A.

Thomas

2010. “Extending the Fellegi-Sunter Probabilistic Record Linkage Method for Approximate Field Comparators.”Journal of Biomedical Informatics 43: 24–30. DOI: https://doi.org/10.1016/j.jbi.2009.08.004.

Fellegi

I. P.

Sunter

A. B.

1969. “A Theory for Record Linkage.”Journal of the American Statistical Association 64: 1183–210. DOI: https://doi.org/10.1080/01621459.1969.10501049.

10.

Goldstein

Harron

Cortina-Borja

2017. “A Scaling Approach to Record Linkage.”Statistics in Medicine 36: 2514–21. DOI: https://doi.org/10.1002/sim.7287.

11.

Gomatam

Carter

Ariet

Mitchell

2002. “An Empirical Comparison of Record Linkage Procedures.”Statistics in Medicine 21: 1485–96. DOI: https://doi.org/10.1002/sim.1147.

12.

Kim

Chambers

2010. “Regression Analysis for Longitudinally Linked Data.” Working Paper 22-10, Centre for Statistical and Survey Methodology, University of Wollongong.

13.

Lahiri

Larsen

M. D.

2005. “Regression Analysis with Linked Data.”Journal of the American Statistical Association 100: 222–30. DOI: https://doi.org/10.1198/016214504000001277.

14.

Lee

Zhang

L.-C.

Kim

J. K.

2022. “Maximum Entropy Classification for Record Linkage.”Survey Methodology 48: 1–23. https://www150.statcan.gc.ca/n1/pub/12-001-x/2022001/article/00007-eng.htm.

15.

Moretti

Shlomo

2023. “Improving Probabilistic Record Linkage Using Statistical Prediction Models.”International Statistical Review 91 (3): 368–94. DOI: https://doi.org/10.1111/insr.12535.

16.

Sadinle

Fienberg

S. E.

2013. “A Generalized Fellegi–Sunter Framework for Multiple Record Linkage with Application to Homicide Record Systems.”Journal of the American Statistical Association 108 (502): 385–97. DOI: https://doi.org/10.1080/01621459.2012.757231.

17.

Scheuren

Winkler

W. E.

1993. “Regression Analysis of Data Files That Are Computer Matched.”Survey Methodology 19: 39–58. https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X199300114476

18.

Tancredi

Liseo

2011. “A Hierarchical Bayesian Approach to Record Linkage and Population Size Problems.”Annals of Applied Statistics 5: 1553–85. DOI: https://doi.org/10.1214/10-AOAS447.

19.

T. H.

Chauvet

Happe

Oger

Paquelet

Garès

2021. “Extending the Fellegi-Sunter Record Linkage Model for Mixed-Type Data with Application to the French National Health Data System.”Hal-03290773. https://hal.archives-ouvertes.fr/hal-03290773.

20.

Zhu

Matsuyama

Ohashi

Setoguchi

2015. “When to Conduct Probabilistic Linkage vs. Deterministic Linkage? A Simulation Study.”Journal of Biomedical Informatics 56: 80–6. DOI: https://doi.org/10.1016/j.jbi.2015.05.012.