22

I am using Gensim Library in python for using and training word2vector model. Recently, I was looking at initializing my model weights with some pre-trained word2vec model such as (GoogleNewDataset pretrained model). I have been struggling with it for a couple of weeks. Now, I just found out that in gesim there is a function that can help me initialize the weights of my model with pre-trained model weights.

That is mentioned below:

reset_from(other_model)
Borrow shareable pre-built structures (like vocab) from the other_model. Useful if testing multiple models in parallel on the same corpus.

I do not know this function can do the same thing or not. Please help!

Ethan
  • 1,657
  • 9
  • 25
  • 39
Nomiluks
  • 471
  • 1
  • 4
  • 9

4 Answers4

23

Thank Abhishek. I've figure it out! Here are my experiments.

1). we plot a easy example:

from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
            ['this', 'is', 'the', 'second', 'sentence'],
            ['yet', 'another', 'sentence'],
            ['one', 'more', 'sentence'],
            ['and', 'the', 'final', 'sentence']]
# train model
model_1 = Word2Vec(sentences, size=300, min_count=1)

# fit a 2d PCA model to the vectors
X = model_1[model_1.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model_1.wv.vocab)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

enter image description here

From the above plots, we can see that easy sentences cannot distinguish different words' meaning by distances.

2). Load pre-trained word embedding:

from gensim.models import KeyedVectors

model_2 = Word2Vec(size=300, min_count=1)
model_2.build_vocab(sentences)
total_examples = model_2.corpus_count
model = KeyedVectors.load_word2vec_format("glove.6B.300d.txt", binary=False)
model_2.build_vocab([list(model.vocab.keys())], update=True)
model_2.intersect_word2vec_format("glove.6B.300d.txt", binary=False, lockf=1.0)
model_2.train(sentences, total_examples=total_examples, epochs=model_2.iter)

# fit a 2d PCA model to the vectors
X = model_2[model_1.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model_1.wv.vocab)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

enter image description here

From the above figure, we can see that word embeddings are more meaningful.
Hope this answer will be helpful.

Shixiang Wan
  • 331
  • 2
  • 3
5

Let us look at a sample code:

>>>from gensim.models import word2vec

#let us train a sample model like yours >>>sentences = [['first', 'sentence'], ['second', 'sentence']] >>>model1 = word2vec.Word2Vec(sentences, min_count=1)

#let this be the model from which you want to reset >>>sentences = [['third', 'sentence'], ['fourth', 'sentence']] >>>model2 = word2vec.Word2Vec(sentences, min_count=1) >>>model1.reset_from(model2) >>>model1.similarity('third','sentence') -0.064622000988260417

Hence, we observe that model1 is being reset by the model2 and hence the word, 'third' and 'sentence' are in it's vocabulary eventually giving its similarity. This is the basic use, you can also check reset_weights() to reset the weights to untrained/initial state.

Zephyr
  • 997
  • 4
  • 11
  • 20
Hima Varsha
  • 2,366
  • 16
  • 34
2

If you are looking for a pre-trained net for word-embeddings, I would suggest GloVe. The following blog from Keras is very informative of how to implement this. It also has a link to the pre-trained GloVe embeddings. There are pre-trained word vectors ranging from a 50 dimensional vector to 300 dimensional vectors. They were built on either Wikipedia, Common Crawl Data, or Twitter data. You can download them here. Additionally, you should examine the keras blog on how to implement them.

Zephyr
  • 997
  • 4
  • 11
  • 20
Samuel Sherman
  • 346
  • 2
  • 4
1

I have done it here in my github repository.

See if this is what you need.

Ethan
  • 1,657
  • 9
  • 25
  • 39
Abhishek
  • 121
  • 3