Models for Merging Passes and Creating Linked Datasets

LinkSolv Statistical Models for Merging Passes and Creating Linked Datasets

Limited Bayesian Model

Set Cutoff Probability = 0.90 or some other high value in the Specify Match dialog. Set Linked Data Sets = 1 on the Merge Passes tab. LinkSolv uses your estimates of total matches, error probabilities, and frequencies of data values in all probability calculations. All candidate pairs over the specified cutoff probability are accepted as linked pairs and assigned keep status = LP.

Take All Pairs

Use this method if you expect more true links to be many-to-many rather than one-to-one. Set Cutoff Probability = 0.01 or some other low value so that almost all true links are over the cutoff. Set Pairs to Analyze = Take All Pairs on the Merge Passes tab. Set Linked Datasets > 1 – 3 to 5 imputations are usually enough to capture uncertainty about linked pairs when you analyze linked datasets. LinkSolv starts with your prior estimates of total matches, error probabilities, and frequencies of data values to calculate posterior estimates that take into account observed values using Markov Chain Monte Carlo iterative methods. Set Number of Iterations great enough so that MCMC posterior estimates have converged to stable values – 5 to 10 iterations are usually enough and you can always add new iterations if necessary after reviewing the Bayesian Model Check report. LinkSolv actually runs twice as many iterations as you specify then ignores the first half as “burn in,” likely to be too dependent on prior estimates. Specify Standard Deviation and Standard Error to quantify uncertainty about your Total Matches estimate. These parameters can be calculated from the sample of values you used to estimate Total Matches, where Standard Error = Standard Deviation / Square Root of the Number of Values. Use Standard Deviation = 10% of Total Matches if it’s only based on one value.

Take 1-1 Pairs, Take Max Pairs (new label for Take 1-1 Pairs), Draw 1-1 Pairs, or Take LSAP Pairs

Use one of these methods if you expect most true links to be one-to-one rather than many-to-many. Set Cutoff Probability = 0.01 or some other low value so that almost all true links are over the cutoff. Set Pairs to Analyze to one of these values on the Merge Passes tab. Set other merge parameters following the guidelines for Take All Pairs. All of these methods group many-to-many pairs into sets based on having a common record – if two pairs have the same record from table A or from table B then they are assigned to the same set. The methods differ in how one-to-one pairs are selected from each set but one-to-one pairs are always assigned Keep Status = LP and others get Keep Status = IP.

Take Max Pairs includes the pair from each set with maximum probability in the one-to-one linkage and excludes competing pairs (a common record) with lower probabilities. The process repeats until all pairs in all sets have Keep Status = LP or IP.

Draw 1-1 Pairs extends the original Bayesian model so that one-to-one pairs are drawn as part of the overall probability model. This method is preferred by theorists, particularly if you plan to analyze multiple imputations using IVEWARE or SAS PROC MIANALYZE. LinkSolv calculates the probability of each one-to-one permutation of records in each set. For example, if a set includes records A1 and A2 from table A and records B1 and B2 from table B then the one-to-one permutations are (A1, B1; A2, B2) and (A1, B2; A2, B1), either of which might be drawn in each iteration.

Take LSAP Pairs treats selecting linked pairs as a Linear Sum Assignment Problem (LSAP). Given the probabilities of each one-to-one permutation calculated for Draw 1-1 Pairs, Take LSAP Pairs takes the one-to-one permutation which maximizes the sum of match weights, which is the same as the permutation with maximum probability.

Best Pairs

LinkSolv ranks pairs in sets by probability from greatest to least and then selects as many pairs from the top of the list as possible given your specified False Positive Rate. Remember that Match Probability = 0.9 means that 9 out of 10 such links are true and 1 out of 10 is false. So, each 0.9 link in the list contributes 0.9 links toward Expected True Positives and 0.1 links toward Expected False Positives, and similarly for other probabilities. For a given sample of pairs from the top, LinkSolv estimates False Positive Rate = Expected False Positives / (Expected False Positives + Expected True Positives).

Maximum Likelihood Linkage

LinkSolv calculates the likelihood of the Keep Status of all merged pairs for each iteration after burn-in. Each iteration produces an unbiased draw from a stable posterior distribution. Each merged pair can contributes two possible values to the likelihood. If Keep Status = LP then the contribution is the Match Probability for the pair. Otherwise, the contribution is (1 – Match Probability). LinkSolv compares the likelihoods for all iterations and saves the one draw with the maximum as a Maximum Likelihood Estimate. The MLE may not be one of the final imputations of linked pairs.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s