4

I had a face to face interview for a data scientist job a few days ago. One of the questions I was asked was: in the case of classifier predicting the brand of TV from some features (price, size, specs, ...) out of 4 possible brands, how do you encode the brand variable? My answer was one hot encoding, it was accepted but then they asked me to do it explicitly and I sketched something like:

brand A -> [1,0,0,0]
brand B -> [0,1,0,0]
brand C -> [0,0,1,0]
brand D -> [0,0,0,1]

And then, I was corrected under the reason that these columns were not independent. And that the solution should have been three binary columns instead.

Later it hit me that I do not know why independence is required, and also that three binary variables are not independent. Two would be.

Can someone provide some explanation to help with my confusion?

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
Learning is a mess
  • 646
  • 1
  • 8
  • 16

1 Answers1

1

For certain models, it can be important to make sure the inputs are linearly independent. I'm pretty sure this is never the case for the outputs. If we have more than two classes, it's extremely difficult if not impossible to output class-probabilities if we don't have a specific dimension for each class in the output vector. In other words, your suggestion has much more explanatory power than what was suggested by your interviewer, which is solving a problem that exists on the inputs for some models but not for the outputs.

In multi-class problems, it is indeed the norm that each class has their own dimension in the output vector as you suggested. Either the interviewer was trying to look smart and confused input and output encoding (which still would only apply to certain families of models), or you aren't remembering the events of the interview properly. If you're confident that the interview went as you described, you should consider complaining to your point of contact that you were criticized for giving a correct answer.

Feel free to cite the following as popular/canonical examples of multi-class classifiers:

Returns: array of shape = [n_samples, n_classes]

when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample)

David Marx
  • 3,288
  • 16
  • 23