LinkSolv 8.3.0855 has an improved algorithm for agreements within a tolerance. Ideally, we would compare every value in a frequency table for Source A with every value in the corresponding frequency table for Source B to find which values agree, given a specified comparison method and tolerance. This is called a CROSS JOIN between the two tables. Probabilities for all agreeing values in frequency table B are summed (‘Sum Prob B’) for each value in frequency table A for match probability calculations. However, CROSS JOINs can take very long to run even if the database software doesn’t complain: 365 Dates X 365 Dates = 133,225 comparisons; 1,440 Times X 1,440 Times = 2,073,600 comparisons; 36,500 Birth Dates X 36,500 Birth Dates = 1,332,250,000 comparisons.
In the old algorithm, LinkSolv first compared all values in frequency table A against all values in frequency table B for EXACT equality (this is not a CROSS JOIN so relatively quick). Second, all values in frequency table A were compared against a subset of frequency table B for additional agreements that were not EXACT. The subset was chosen so that the number of comparisons was less than 10,000,000 and included as many of the most common B values as possible — these are the values that make the greatest contribution to Sum Prob B.
For some distributions, the 10,000,000 comparisons limit resulted in small subsets. For frequency tables with, say, about 300,000 different names the subset of B would include only about 33 of the most common names (300,000 X 33 = 9,900,000 comparisons). This happens to work well for PREFIX comparisons of names to initials because many initials show up in the top 33 values, but other situations are not as lucky.
Simply increasing the maximum number of comparisons to 20,000,000 or 40,000,000 was not feasible — 10,000,000 comparisons already take a fairly long time on many computers and 20,000,000 would takes for than twice as long. Also, when Microsoft Windows or Access runs out of system resources and gives up it is usually after hours of chugging along.
In the new algorithm, LinkSolv runs several CROSS JOINs successively in a loop, each comparing a subset of values in frequency table B to all values in frequency table A. Subsets are chosen so that the number of comparisons in each CROSS JOIN is less than 4,000,000 and takes a few minutes to run. Also, subsets are chosen so that values with highest probabilities that have not yet been compared are selected first. LinkSolv limits total run time for all CROSS JOINs to about 20 minutes — about 10 CROSS JOINs. LinkSolv also keeps track of the corresponding fraction of all B records compared. For example, if the most common B value had probability 0.10, the second had 0.05, and the third had 0.01 then comparing just these three values would be equivalent to comparing 16% (10% + 5% + 1%) of all B records.
If not all B values can be compared in 20 minutes then LinkSolv adds an average contribution to Sum Prob B for all uncompared values. The average contribution is based on statistics collected for all CROSS JOINs that were run.