1

I want train fasttext unsupervised model on my text dataset. However there are many hyperparameters in train_unsupervised method:

    lr                # learning rate [0.05]
    dim               # size of word vectors [100]
    ws                # size of the context window [5]
    epoch             # number of epochs [5]
    minCount          # minimal number of word occurences [5]
    minn              # min length of char ngram [3]
    maxn              # max length of char ngram [6]
    neg               # number of negatives sampled [5]
    wordNgrams        # max length of word ngram [1]
    thread            # number of threads [number of cpus]
    lrUpdateRate      # change the rate of updates for the learning rate [100]

Some of them influence quality of embeddings dramatically (dim, lr, minn, maxn especially). However I haven't found any method for tuning those hyperparameters. How could I do that? And also, how features of my dataset (mean sentence length for example) may influence choice of some of those hyperparameters?

Ir8_mind
  • 183
  • 1
  • 4

1 Answers1

2

In order to tune hyperparameters, you'll need an evaluation metric. One evaluation metric for embeddings is performance on analogies (e.g., man is to king as woman is to _____). There is an analogy test set created by Google. You can adjust embedding hyperparameter values and see which ones perform better on that collection of analogies.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113