Find approximate 'best' matching pairs by calculating the fewest possible weights

Question

My specific problem is as follows:

Given two list of texts (in the order of 5 to 50 items)
Find best matching pairs with their corresponding matching score (weight)
Where each item can only be matched once with another item (from the other list)
While minimizing the calculation of the weights (Levenshtein edit distance in my case)
Every score can be between 0.0 (no match) and 1.0 (perfect match)
I may only require pairs to be matched, that meet a certain threshold

Because the edit distance is quite expensive to calculate, it is the bottleneck. At the same time, I may be able to exclude some edges by calculating the upper limit (if they don't meet the threshold).

I guess this could be a special case of the assignment problem.

Assuming the size of list 1 is n and the size of list 2 is m.

It seems in order to apply most common algorithm, one need to calculate all of the scores between all possible pairs. i.e. in this case it would be n * m

One approach I tried to far is:

Copy list 2 to remaining list 2
For each item 1 in list 1
- Find best match meeting threshold of item 1 in list 1:
  - Copy remaining list 2 to temp list 2
  - Sort temp list 2 by approximate score between item 1 and each item in temp list 2 (in descending order)
  - Remove items from temp list 2 where approximate score is below threshold
  - For each item 2 in temp list 2:
    - Calculate a expensive score between item 1 and item 2
    - If expensive score is 1.0: Return pair between item 1 and item 2
    - If expensive score is meeting threshold: Save item 2 as best matching item 2
  - Return pair between item 1 and best matching item 2
- Remove item 2 of matched pair from remaining list 2

I am thinking this can probably be improved by initially using the first found matching pair meeting the threshold. And then later see if unmatched item 1 can be matched with an already paired item 2, whereas item 2 could also be matched with another item 1.

Do any existing algorithms address a similar problem?

Additional Note: Edit Distance Approximations

Currently I am using the following two approximations (both used by Python's difflib):

Based on the length, we can calculate the maximum score we could get (e.g. if item 1 is only half the length of item 2 then the score can't be more than 0.5)
Intersection of character counts

Find approximate 'best' matching pairs by calculating the fewest possible weights

Additional Note: Edit Distance Approximations

0 Answers0