nlp: phonetic edit distance between a word and the closest of a set of words

Question

Let's say someone is using Dragon Dictation, Google Speech, or some other free form dictation software (it will recognize anything they say to the best of its ability). I have some reasonably large set of words and I'm certain the speaker is trying to say one of these words. However, the voice recognition system is not perfect, so sometimes the voice recognition engine will spit out a word that is phonetically similar to what the user intended, but that is not in the set.

In other words, I have a word X, and a set of words Y. I want the member of Y which sounds the most like X.

I know that phonemes are the most basic sound unit of sound, so I've attempted to use CMU's Pronouncing dictionary to break X into phonemes, and use something similar to an edit distance algorithm to alter these phonemes until they match one of the members of Y. However, there are some huge issues:

There are at least 40 phonemes in CMU's system, which makes for a huge branching factor when you consider adding or replacing at any given position. I can't thoroughly search beyond three changes or so, which isn't sufficient.
Phonemes are perhaps too 'low-level' in the sense that there are many combinations of phonemes which are not words at all.

Building on complaint #2, I was thinking it would be nice to have a weighted graph of at least the most common words to other reasonably similar words, and I could run a search algorithm on this graph to find words in Y more quickly, and with a less explosive branching factor. Precomputing such a graph is on the table, as it could just be loaded at runtime.

Are there any sort of methods for doing this which exist currently? Or should I be thinking in a different direction?

score 5 · Accepted Answer · edited Oct 25 '17 at 18:52

Building on the comments, there's something called the Needleman–Wunsch algorithm, for which you can find the lowest 'cost' alignment of two sequences (phonemes). You can get a good 'confusion cost matrix' by taking a normal confusion matrix and taking the log-odds of each confusion pair.

As long as your set of words is less than 33k or so, you should be able to find the lowest cost match in about a second or so. Multiply that by N if you multithread across N cores...

Full disclosure: not my ideas here, I got them from Phonemic Similarity Metrics to Compare Pronunciation Methods (Hixon, Schneider and Epstein, in Interspeech 2011).

nlp: phonetic edit distance between a word and the closest of a set of words

1 Answers1