0

I have a dataset that came from NLP for technical documents

my dataset has 60,000 records

There are 30,000 features in the dataset

and the value is the number of repetitions that word/feature appeared

here is a sample of the dataset

RowID       Microsoft  Internet  PCI  Laptop  Google  AWS  iPhone  Chrome
1              8          2       0      0      5      1      0       0
2              0          1       0      1      1      4      1       0
3              0          0       0      7      1      0      5       0
4              1          0       0      1      6      7      5       0
5              5          1       0      0      5      0      3       1
6              1          5       0      8      0      1      0       0

Total 9,470 821 5 107 4,605 719 25 8 Appearance

There are some words that only appeared less than 10 times in the whole dataset

The technique is to select only words/features that appeared in the dataset for more than a certain number (say 100)

what is this technique called? the one that only uses features that in total appeared more than a certain number.

asmgx
  • 549
  • 2
  • 18

1 Answers1

0

I might not be aware of it, but I don't think there is a term for this technique. I would call this "[filtering words based on a] minimum frequency [threshold]" or similar.

It's extremely common, in fact I tend to think that not doing it is a mistake, unless there is a good reason. The rationale is that rare words are likely to cause overfitting, since their association with a particular label is usually due to chance.

Note: I often mention this point, for example here, here, there...

Erwan
  • 26,519
  • 3
  • 16
  • 39