1

I have a system where i get as input array of feature strings:

["kol","bol","sol","nol"]

The length of this array is dynamic, i can get 2, 4 or 6 etc, total features <20

I need to make a decision according to this array, the decision is another string:

x = ["feature1","feature5","feature3","feature8"] #in
y = "john" #decide

What I end up doing is creating a table, 1 if exist, 0 otherwise, for each training set (dataframe pandas):

feature1   feature2   feature3   feature4   feature5...  decision
1          0          1          0          1            1 (john mapped to 1, Ly to 2, etc)

I feed this into a Decision Tree Classifier using sklearn. (DecisionTreeClassifier) I train it with 100+ input feature arrays and desired outcomes.

It works, but i do have a feeling that it won't really provide value if the input will be different than trained data, because there is no real meaning/weight to these binary values.

These features strings comes from a Bag of Words in which if appear on a text, i extract it, to create a well defined set of features to train/predict.

  1. can I, or should I change the values from 1/0 to a more weighted ones? how do i get them?
  2. Is this a right approach assuming i have a bag of words in which i look for in a text and produce features that both in the text and the bag.
baltiturg
  • 143
  • 3

1 Answers1

1

This looks closely similar to text classification. The main concept in any supervised classification is that the model receives the same features (in the same order) when it is applied as when it was trained.

This is why traditionally the bag of word representation is used: every word in the vocabulary is assigned an index $i$ to be represented as feature $i$. The value of the feature can be boolean (1 if present in the instance, 0 otherwise) or numerical (frequency of the word in the instance, or some more complex value like TFIDF). The meaning of these feature is simple: it tells the model whether a particular word is present or not. The model calculates how often a particular label is associated with a particular word. Thus in a decision tree the model is made of conditions such as: "if the instance contains word A and does not contain word B and contains word C then the label is Y".

Crucially, the vocabulary is fixed at the training stage. This implies that any new word found in the test instances cannot be used at all. This is the problem of out-of-vocabulary (OOV) words. It's also usually recommended to remove the least frequent words, because they likely happen by chance and cause a high risk of overfitting (see previous link). Overfitting is when the model thinks that there's a strong association between a particular word and a label even though it only had one or two examples which happened by chance.

Erwan
  • 26,519
  • 3
  • 16
  • 39