12

I have just completed the machine learning for R course on cognitiveclass.ai and have begun experimenting with randomforests.

I have made a model by using the "randomForest" library in R. The model classifies by two classes, good, and bad.

I know that when a model is overfit, it performs well on data from its own trainingset but badly on out-of-sample data.

To train and test my model I have shuffled and split the complete dataset into 70% for training and 30% for testing.

My question: I am getting a 100% accuracy out of the prediction done on the testing set. Is this bad? It seems too good to be true.

The objective is waveform recognition on four on each other depending waveforms. The features of the dataset are the cost results of Dynamic Time Warping analysis of waveforms with their target waveform.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
Milan van Dijck
  • 123
  • 1
  • 6

2 Answers2

29

High validation scores like accuracy generally mean that you are not overfitting, however it should lead to caution and may indicate something went wrong. It could also mean that the problem is not too difficult and that your model truly performs well. Two things that could go wrong:

  • You didn't split the data properly and the validation data also occured in your training data, meaning it does indicate overfitting because you are not measuring generalization anymore
  • You use some feature engineering to create additional features and you might have introduced some target leakage, where your rows are using information from it's current target, not just from others in your training set
Jan van der Vegt
  • 9,448
  • 37
  • 52
1

Investigate to see what your most predictive features are. Sometimes you accidentally included your target (or something that is equivalent to your target) amongst your features.

tom
  • 2,288
  • 13
  • 13