22

I am wondering how to label (tag) sentences / paragraphs / documents with doc2vec in gensim - from a practical standpoint.

Do you need to have each sentence / paragraph / document with its own unique label (e.g. "Sent_123")? This seems useful if you want to say "what words or sentences are most similar to a single specific sentence labeled "Sent_123".

Can you have the labels be repeated based on content? For example if each sentence / paragraph / document is about a certain product item (and there are multiple sentence / paragraph / document for a given product item) can you label the sentences based on the item and then compute the similarity between a word or a sentence and this label (which I guess would be like an average of all those sentences that had to do with the product item)?

B_Miner
  • 702
  • 1
  • 7
  • 20

2 Answers2

12

Both are possible. You can give every document a unique ID (such as a sequential serial number) as a doctag, or a shared string doctag representing something else about it, or both at the same time.

The TaggedDocument constructor takes a list of tags. (If you happen to limit yourself to to plain ints ascending from 0, the Doc2Vec model will use those as direct indexes into its backing array, and you'll save a lot of memory that would otherwise be devoted to a string -> index lookup, which could be important for large datasets. But you can use string doctags or even a mixture of int and string doctags.)

You'll have to experiment with what works best for your needs.

For some classification tasks, an approach that's sometimes worked better than I would have expected is skipping per-text IDs entirely, and just training the Doc2Vec model with known-class examples, with the desired classes as the doctags. You then get 'doc vectors' just for the class doctags – not every document – a potentially much smaller model. Later inferring vectors for new texts results in vectors meaningfully close to related class doc vectors.

gojomo
  • 236
  • 1
  • 3
11

doc2vec model gets its algorithm from word2vec.

In word2vec there is no need to label the words, because every word has their own semantic meaning in the vocabulary. But in case of doc2vec, there is a need to specify that how many number of words or sentences convey a semantic meaning, so that the algorithm could identify it as a single entity. For this reason, we are specifying labels or tags to sentence or paragraph depending on the level of semantic meaning conveyed.

If we specify a single label to multiple sentences in a paragraph, it means that all the sentences in the paragraph are required to convey the meaning. On the other hand, if we specify variable labels to all the sentences in a paragraph, it means that each conveys a semantic meaning and they may or may not have similarity among them.

In simple terms, a label means semantic meaning of something.

chmodsss
  • 1,974
  • 2
  • 19
  • 37