Data Matching Using Machine Learning

Question

I have around 4000 customer records and 6000 user records and about 3000 customer records match leaving 1000 unmatched customers. I have created a fuzzy matching algorithm using Levenshtein and Hamming and added weights to certain properties, but I want to be able to match the remaining records without manually doing this. Ideally I want to implement an algorithm to take a customer and user and output match/no match. However, wouldn't I need to train with true negatives? Is there an algorithm that can train with just 1 label? Thanks

D.W. · Accepted Answer · 2018-03-30T23:21:17.843

You can obtain one negative example by taking one of the 3000 customer records and pairing it with any user record that is known not to match. In this way, you can obtain $3000$ positives and $3000 \times 5999$ negatives. You could then train a boolean classifier on this entire training set. This might work better than using one-class classification on just the positives.

Even better might be to use techniques for learning to rank. If $c$ is a customer record that is known to match a user record $u$, and $u'$ is any other user record (which $c$ doesn't match), then you want your classifier to rank the pair $(c,u)$ higher than $(c,u')$. In this way you can obtain $3000 \times 5999$ such ranking-pairs, and try to train a classifier to learn to rank, then use that to find the best match for each of the 1000 customer records.

Data Matching Using Machine Learning

1 Answers1