1

I have a basic doubt with regards to conversion of text to numbers and feeding it to LSTM. I am aware of the different methods such as OneHot, CountVectorizer, TfIDF, Word2vec etc. My doubt is, If we use a Count Vectoriser or Tfidf, Then in LSTM, we have to pass through the entire vocabulary of words for each sentence since that's how TFIDF and count vectoriser encodes the sentences. Am I right?

My second doubt is, If we use TFIDF or COuntVectorizer, Each word will have different value based on its occurrence and frequency. This is in contrast to Word2Vec where a embedding is learned and used. If each time the LSTM model sees different values for a particular word, How can it learn? Like in a sentence if the word "Hi" appears 6 times, Its encoded with the number 6 in its appropriate index, And in another sentence if it appears 4 times, we encode with the value 4. How does this work? It doesn't make sense.

mewbie
  • 119
  • 5

1 Answers1

1

The difference between the traditional bag of words representation and the word embedding representation is that:

  • bag of words: every index of a vector represents a specific word. Since there must be an index for every possible word, the dimension of every vector must indeed be the full vocabulary size.
  • embedding: words and sentences are represented in an (usually pre-trained) embedding space. The dimension of this space is predefined and arbitrary, and there's no way to know directly what every index represents. Indirectly, it can be proved that indexes can represent quite precise semantic concepts.

Anyway, in both cases the features values (which can vary) don't carry the meaning, it's always the fixed indexes which represent a particular semantic concept.

In your example, say the word "Hi" has index 1234: the fact that this specific index contains 6 or 4 allows the model to recognize a similarity between these 2 sentences. Note that in an embedding representation it's also the indexes which carry the concept. For example maybe "Hi" would have a important value for the dimension related to "salutations" and this would allow the model to find a similarity with worlds like "hello", "dear X", etc.

Erwan
  • 26,519
  • 3
  • 16
  • 39