3

This is a very general question so lets take a very general example: imagine a CNN model that distinguishes between dogs and cats facial features images. we have two kinds of training data set: one with full features and one with defective. by defective I mean they contain noise and they lack important details BUT still they are dogs and cats facial features, so defective here means they are noisy subset of full feature images (imagine a noisy dog's face without a nose or ears!). Now:

IF WE KNOW that the testing data set will be defective images, is it better to train CNN by defective data set or is it better to train CNN by full feature data set?

For the main answer, let's consider denoising datasets is not possible.

hamflow
  • 133
  • 5

2 Answers2

5

When you know that the testing dataset will contain defective images (e.g., noisy, missing features, or other distortions), it is generally better to train your CNN model with the defective dataset rather than with clean, full-feature images. Here's why:

1. Model can Adapt to Noise:

A CNN trained on clean, full-feature images may not perform well on noisy or defective images because it learns to extract features assuming the presence of complete, high-quality data. The model may rely on specific image details that are missing or degraded in the testing data, leading to poor generalization and performance when faced with defects or noise during testing.

On the other hand, when the CNN is trained on defective images (which have noise and missing details), it learns to identify the most robust features that can still distinguish between the categories (e.g., dog vs. cat) despite the noise or missing parts. This enables the model to adapt to the characteristics of the testing data, which is also defective.

2. Transfer Learning Between Training and Testing Data:

By definition (according to Wikipedia):

Test Data Set: A test data set is a data set that is independent of the training data set, but that follows the same probability distribution as the training data set.

If your training data has full features and the testing data is defective (i.e. does not follow the same probability distribution as the training set), the model might not be able to recognize patterns or make accurate predictions in the real-world scenario, where the testing images are noisy. This "domain shift" between clean training data and noisy testing data can severely affect performance. Training on defective data essentially "prepares" the model for the type of data it will encounter during testing.

However, if you're forced to train on clean data, the model might rely too heavily on specific, clean features that do not appear in the defective data. As a result, the model's ability to generalize will be reduced when exposed to noisy or incomplete test samples.

3. Robust Feature Learning:

Defective images might still contain enough information to learn important features, even if they are noisy or incomplete. By training on these defective images, the CNN learns to extract features that are more robust to the types of distortions that will be present in the test set.

This is similar to the concept of Data Augmentation Technique in which the model is exposed to varied forms of the data (e.g., rotations, translations, etc.), helping it become more resilient to input variability.

So, based on the hypothetical scenario you described with images of dogs and cats, I would recommend training the CNN using a mix of both defective and non-defective images, so the CNN can generalize better when trying to classify an image of a dog or a cat (even with defective images) and with respect to the testing set, you must also make sure that it contains defective and non-defective images (it has to follow a probability distribution similar to the training set).

0

A good rule of thumb is that the training dataset should be similar to the testing dataset. Hence, if you test with bad images you should train with bad images. If you train with only good images, the network is likely to learn patterns present only present in good images, knowledge which it has no use for in testing.

That said, training with both good and bad images may be even better. You can use various data augmentation techniques to increase the amount of training data and to decrease the likelihood of overfitting.