0

If I have a matrix of co-occurring words in conversations of different lengths, is it appropriate to standardize / normalize the data prior to training?

My matrix is set up as follows: one row per two-person conversation, and columns are the words that co-occur between speakers. I cannot help but think that, as a longer conversation will likely comprise more shared words than shorter ones, I should factor this in somehow.

cookie1986
  • 179
  • 1
  • 5

1 Answers1

4

Thanks for the clarification by commenting. Tree-based models do not care about the absolute value that a feature takes. They only care about the order of the values. Hence, normalization is used mainly in linear models/knn/neural networks because they're affected by absolute values taken by features.

You don't need to normalize/standardize.

Check this post.

Blenz
  • 2,124
  • 13
  • 29