11

I have many times analysed a dataset on which I could not really do any sort of classification. To see whether I can get a classifier I have usually used the following steps:

  1. Generate box plots of label against numerical values.
  2. Reduce the dimensionality to 2 or 3 to see if classes are separable, also tried LDA sometimes.
  3. Forcefully try to fit SVMs and Random Forests and look at feature-importance to see if the features make any sense or not.
  4. Try to change the balance of classes and techniques like under-sampling and over-sampling to check if class imbalance might be an issue.

There are many other approaches I can think of, but have not tried. Sometimes I know that these features are not good and not at all related to the label we are trying to predict. I then use that business intuition to end the exercise, concluding that we need better features or totally different labels.

My question is how does a Data Scientist report that the classification can not be done with these features. Is there any statistical way to report this or fitting the data in different algorithms first and looking at validation metric is the best option?

Green Falcon
  • 14,308
  • 10
  • 59
  • 98
vc_dim
  • 188
  • 9

2 Answers2

5

Take a sample element from one class and a sample element from the other class. Is it possible for these two elements to have the exact same feature vector? If that can ever happen then the two classes are not completely seperable using your current feature vectors (since the classification decision is based entirely in the feature vector for a given element).

On the other hand, if *every" element in one class has a corresponding element in the other class such that the two elements have the same feature vectors, then the two classes are indistinguishable using your current feature vectors.

Furthermore, if that condition holds for only some of your elements and not others, then you are somewhere in between, and you can use that as a basis to measure how well you can hope a classifier to perform using your current feature set.

All of these evaluations can be used to argue to varying degrees that you need to extract more features.

4

It depends on your data. There is something called human level error. Suppose tasks like reading of printed books, humans do not struggle to read and it might not happen to make a mistake unless because of bad quality of printing. In cases like reading hand-written manuscripts, it may happen a lot not to understand all words if the font of the writer is odd to reader. In the first situation the human level error is too low and the learning algorithms can have the same performance but the second example illustrates the fact that in some situations the human level error is so much high and in a usual manner (if you use the same features as humans) your learning algorithm will have so much error ratio.

In statistical learning, there is something called Bayes Error, whenever the distribution of classes overlap, the ratio of error is large. without changing the features, the Bayes error of the current distributions is the best performance and can not be reduced at all.

I also suggest you reading here. Problems with a large amount of Bayes error with appointed features are considered not classifiable with in the space of those features. As another example you can suppose you want to classify cars with lights on. If you try to do that in the morning, you yourself may have lots of errors and if you use same images for training the learning algorithm, that may have too.

Also I recommend you not to change the distribution of your classes. In such cases, the result of classifier near the boundary would be completely random. The distribution of data for training your machine learning algorithm should not be changed and should be as it is in the real condition.

Green Falcon
  • 14,308
  • 10
  • 59
  • 98