How much text is enough to train a good embedding model?

Question

I need to train a word2vec embedding model on Wikipedia articles using Gensim.

Eventually, I will use the entire Wikipedia for that
but for the moment, I'm doing some experimentation/optimization to improve the model quality and I was wondering how many articles would be enough to train a meaningful/good model? How many examples needed for each unique word in vocabulary?

score 2 · Answer 1 · answered Nov 26 '20 at 15:59

It is not the number of many articles that matter but the total number of words.

Enough "meaningful/good" is an empirical question that depends on the dataset. One way to test the results of a newly trained model is the Google analogy test set which compares a new model's predicted word to established embedding benchmarks.

As far as minimum number examples needed for each unique token in vocabulary, the general consensus is there should be at least 40 examples per token. If there are fewer than 40 examples for a token, the vector estimates can be unstable and the token should be dropped from training.

How much text is enough to train a good embedding model?

1 Answers1