16

I would like to know how to match postal addresses when their format differ or when one of them is mispelled.

So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so if you have references for me to read, I'm sure it is a subject that may interest several persons.

The solution I found (examples are in R) :

  • Levenshtein distance, which equals the number of characters you have to insert, delete or change to transform one word into another.

    agrep("acusait", c("accusait", "abusait"), max = 2, value = TRUE) ## [1] "accusait" "abusait"

  • The comparison of phonemes

    library(RecordLinkage) soundex(x<-c('accusait','acusait','abusait')) ## [1] "A223" "A223" "A123"

  • The use of a spelling corrector (eventually a bayesian one like Peter Norvig's), but not very efficient on address I guess.

  • I thought about using the suggestions of Google suggest, but likewise, it is not very efficient on personal postal addresses.

  • You can imagine using a machine learning supervised approach but you need to have stored the mispelled requests of users to do so which is not an option for me.

Marcus D
  • 571
  • 1
  • 5
  • 21
Stéphanie C
  • 281
  • 1
  • 2
  • 5

4 Answers4

10

As you are using R you might want to look into the stringdist package and the Jaro-Winkler distance metric that can be used in the calculations. This was developed at the U.S. Census Bureau for linking .

See for more information on the Jaro and Jaro-Winkler distance in this journal.

For a comparison of different matching techniques, read this paper

phiver
  • 718
  • 1
  • 8
  • 18
4

There are lots of clever ways to extend the Levenshtein distance to give a fuller picture. A brief intro to a pretty useful module (for python) called 'Fuzzy Wuzzy' is here by the team at SeatGeek.

A couple things you can do is partial string similarity (if you have different length strings, say m & n with m < n), then you only match for m characters. You can also separate the string into tokens (individual words) and look at how sets of tokens match or arrange them alphabetically and order them.

dmb
  • 326
  • 1
  • 3
4

Another popular technique for detecting partial string matches (though typically at the document-level) is shingling. In essence it is a moving-window approach that extracts out a set of n-grams for the target word/doc and compares them to the sets of n-grams for other words/docs via the Jaccard coefficient. Manning and colleagues (2008) discuss near duplicates and shingling in the context of informational retrieval.

Brandon Loudermilk
  • 1,216
  • 8
  • 19
4

I've written a generic probabalistic fuzzy matcher in Python which will do a reasonable job of matching any type of data:

https://github.com/robinl/fuzzymatcher

It's in memory, so you probably don't want to use it to match datasets which are above about 100k rows.

I've also written a similar project specific to UK addresses, but this assumes you have access to Addressbase Premium. This one isn't in memory, so has been used against the 100m or so UK addresses. See here:

https://github.com/RobinL/AddressMatcher

If you want to get this going quickly I'd reccommend using libpostal to normalise your addresses and then feed them into my generic fuzzymatcher (pip install fuzzymatcher).

You can find usage examples here.

RobinL
  • 141
  • 3