LinkSolv counts the number of records in each file that agree within a specified tolerance by testing possible combinations – all values from the first file versus as many of the most common values from the second file as practical. All possible combinations for data fields with small numbers of values are tested (365 dates X 365 dates = 133,255 combinations). It is not practical to test all combinations for fields with large numbers of values (30,000 birth dates X 30,000 birth dates = 900,000,000 combinations; 100,000 first names X 100,000 first names = 10,000,000,000 combinations). In these cases, LinkSolv only tests a sample of all possible pairs. Sometimes using an alternative linkage model can reduce the number of possible combinations to test. Birth dates could be separated into year, month, and day – each date part has a small number of values. First names could be standardized as SOUNDEX codes – 100,000 different first names might produce 2,500 different SOUNDEX codes (2,500 codes X 2,500 codes = 6,250,000 combinations).
LinkSolv calculates match probabilities by comparing data values on candidate record pairs and applying Bayes Rule – agreements increase probabilities and disagreements decrease them. Agreements on rare values increase probabilities much more than agreements on common values because agreement by chance on a rare value is much less likely. Accepting close values as agreements is often a good strategy for finding many more true links. For example, many true links might have dates that differ by one day or names that differ by one typo. However, it’s important to remember that using a tolerance also increases the number of false links with agreements. LinkSolv determines, by counting, just how rare or common chance agreements are. For example, first names WALTER, WALSER, WALDER, WALKER, and WALLER all agree within one typo, but exact agreement by chance on WALTER might be about 1,000,000 times more likely than exact agreement by chance on WALLER. Agreement by chance within one typo on WALLER would be much more likely than exact agreement because WALLER–WALTER pairs are much more common than WALLER–WALLER pairs. On the other hand, agreement by chance within one typo on WALTER would not be much more likely because WALTER–WALLER pairs are much less common than WALTER–WALTER pairs.