1

Are there any pre-trained models for finding similar word n-grams, where n>1?

FastText, for instance, seems to work only on unigrams:

from pyfasttext import FastText
model = FastText('cc.en.300.bin')
model.nearest_neighbors('dog', k=2000)

[('dogs', 0.8463464975357056), ('puppy', 0.7873005270957947), ('pup', 0.7692237496376038), ('canine', 0.7435278296470642), ...

but it fails on longer n-grams:

model.nearest_neighbors('Gone with the Wind', k=2000)

[('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi', 0.71047443151474),

or

model.nearest_neighbors('Star Wars', k=2000)
[('clockHauser', 0.5432934761047363),
 ('CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo',
  0.5197194218635559),
Fatemeh Rahimi
  • 569
  • 3
  • 16
dzieciou
  • 687
  • 2
  • 6
  • 16

1 Answers1

1

First off, there aren't, to my knowledge, models trained specifically to generate ngram embeddings. Although, it would be very easy to modify the word2vec algorithm to accommodate ngrams.

Now, what can you do?

You could compute the ngram embedding by summing up the individual word embeddings. Potentially, you could apply weights based on tfidf for instance, but not required. Once you have 1 embedding, simply find a nearest neighbor using cosine distance.

Another approach, though more computionally expensive would be to compute the Earth Mover's Distance (also called Wasserstein) between ngrams and find nearest neighbors this way.

Valentin Calomme
  • 6,256
  • 3
  • 23
  • 54