My specific problem is as follows:
- Given two list of texts (in the order of 5 to 50 items)
- Find best matching pairs with their corresponding matching score (weight)
- Where each item can only be matched once with another item (from the other list)
- While minimizing the calculation of the weights (Levenshtein edit distance in my case)
- Every score can be between 0.0 (no match) and 1.0 (perfect match)
- I may only require pairs to be matched, that meet a certain threshold
Because the edit distance is quite expensive to calculate, it is the bottleneck. At the same time, I may be able to exclude some edges by calculating the upper limit (if they don't meet the threshold).
I guess this could be a special case of the assignment problem.
Assuming the size of list 1 is n and the size of list 2 is m.
It seems in order to apply most common algorithm, one need to calculate all of the scores between all possible pairs.
i.e. in this case it would be n * m
One approach I tried to far is:
- Copy
list 2toremaining list 2 - For each
item 1inlist 1- Find best match meeting threshold of
item 1inlist 1:- Copy
remaining list 2totemp list 2 - Sort
temp list 2byapproximate scorebetweenitem 1and each item intemp list 2(in descending order) - Remove items from
temp list 2where approximate score is below threshold - For each
item 2intemp list 2:- Calculate a
expensive scorebetweenitem 1anditem 2 - If
expensive scoreis1.0: Return pair betweenitem 1anditem 2 - If
expensive scoreis meeting threshold: Saveitem 2asbest matching item 2
- Calculate a
- Return pair between
item 1andbest matching item 2
- Copy
- Remove
item 2of matched pair fromremaining list 2
- Find best match meeting threshold of
I am thinking this can probably be improved by initially using the first found matching pair meeting the threshold.
And then later see if unmatched item 1 can be matched with an already paired item 2, whereas item 2 could also be matched with another item 1.
Do any existing algorithms address a similar problem?
Additional Note: Edit Distance Approximations
Currently I am using the following two approximations (both used by Python's difflib):
- Based on the length, we can calculate the maximum score we could get (e.g. if
item 1is only half the length ofitem 2then the score can't be more than0.5) - Intersection of character counts