1

My specific problem is as follows:

  • Given two list of texts (in the order of 5 to 50 items)
  • Find best matching pairs with their corresponding matching score (weight)
  • Where each item can only be matched once with another item (from the other list)
  • While minimizing the calculation of the weights (Levenshtein edit distance in my case)
  • Every score can be between 0.0 (no match) and 1.0 (perfect match)
  • I may only require pairs to be matched, that meet a certain threshold

Because the edit distance is quite expensive to calculate, it is the bottleneck. At the same time, I may be able to exclude some edges by calculating the upper limit (if they don't meet the threshold).

I guess this could be a special case of the assignment problem.

Assuming the size of list 1 is n and the size of list 2 is m.

It seems in order to apply most common algorithm, one need to calculate all of the scores between all possible pairs. i.e. in this case it would be n * m

One approach I tried to far is:

  • Copy list 2 to remaining list 2
  • For each item 1 in list 1
    • Find best match meeting threshold of item 1 in list 1:
      • Copy remaining list 2 to temp list 2
      • Sort temp list 2 by approximate score between item 1 and each item in temp list 2 (in descending order)
      • Remove items from temp list 2 where approximate score is below threshold
      • For each item 2 in temp list 2:
        • Calculate a expensive score between item 1 and item 2
        • If expensive score is 1.0: Return pair between item 1 and item 2
        • If expensive score is meeting threshold: Save item 2 as best matching item 2
      • Return pair between item 1 and best matching item 2
    • Remove item 2 of matched pair from remaining list 2

I am thinking this can probably be improved by initially using the first found matching pair meeting the threshold. And then later see if unmatched item 1 can be matched with an already paired item 2, whereas item 2 could also be matched with another item 1.

Do any existing algorithms address a similar problem?

Additional Note: Edit Distance Approximations

Currently I am using the following two approximations (both used by Python's difflib):

  • Based on the length, we can calculate the maximum score we could get (e.g. if item 1 is only half the length of item 2 then the score can't be more than 0.5)
  • Intersection of character counts
de1
  • 111
  • 2

0 Answers0