What is exactly the difference between Validation data and Testing data

Question

I asked this question on stack overflow and was told that this is a better place for it.

I am confused with the terms validation and testing, is validating the model same as testing it? is it possible to use testing data for validation?

what even confuses me more is that when to use validation? is it a necessary step for the model? also is it possible to do validation instead of testing?

also can the training data be the same as the validation data?

also can you tell if this code does testing? it is really confusing me

model.fit_generator(
    training_gen(1000,25),
    steps_per_epoch=50,
    epochs=10000,
    validation_data=validation_gen(1000, 25),
    validation_steps=1,
    callbacks=[checkpoint],
    verbose=2)
model.load_weights('./temp_trained_25.h5')
BER = []
for SNR in range(5, 30, 5):
    y = model.evaluate(
        validation_gen(10000, SNR),
        steps=1
    )
    BER.append(y[1])
    print(y)
print(BER)

noting that training_gen and validation_gen are:

def training_gen(bs, SNRdb = 20):
    while True:
        index = np.random.choice(np.arange(train_size), size=bs)
        H_total = channel_train[index]
        input_samples = []
        input_labels = []
        for H in H_total:
            bits = np.random.binomial(n=1, p=0.5, size=(payloadBits_per_OFDM,))
            signal_output, para = ofdm_simulate(bits, H, SNRdb)
            input_labels.append(bits[0:16])
            input_samples.append(signal_output)
        yield (np.asarray(input_samples), np.asarray(input_labels))
def validation_gen(bs, SNRdb = 20):
    while True:
        index = np.random.choice(np.arange(train_size), size=bs)
        H_total = channel_train[index]
        input_samples = []
        input_labels = []
        for H in H_total:
            bits = np.random.binomial(n=1, p=0.5, size=(payloadBits_per_OFDM,))
            signal_output, para = ofdm_simulate(bits, H, SNRdb)
            input_labels.append(bits[0:16])
            input_samples.append(signal_output)
        yield (np.asarray(input_samples), np.asarray(input_labels))

I'm quite new to deep learning and it seems like everything confuses me, sorry if my questions seems dump and unreasonable but please if you can help me to figure out this confusion I would be thankful.

Thanks in advance!

score 4 · Accepted Answer · edited Jan 23 '22 at 20:05

Usually you first split your dataset into train/test set, and then if your model training process requires a validation set, you can further split your train-set into the final train-set and the validation-set. A simple rule is that the test set never shows up in your model development process, including when you develop your data preprocessing steps (such as your data normalizer).

You need a validation set in the following cases:

Training a gradient boosted decision tree (lgbm, xgboost, etc.) with early-stopping enabled. Because it needs to evaluate your model with a validation set after each training step to see if the early-stopping criteria is satisfied
Training a neural network. This is optional but suggested because you can get the validation score curve in addition to the always-there training score curve to monitor if your model begins to overfit. This is required when you use early-stopping.
You are doing cross validation. The idea is to fit the model with the same hyperparameter set N times but at each time it uses a different train-set and validation-set. In this way you know how the same set of hyperparameters works for different data scenarios.

One validation set should only serve one purpose, so if you do both 1 and 3, then you first split your data into train/test set. Then at each cross validation round (out of N rounds), you split your train-set into another train-set and the $1^{st}$ validation set. Then in training your GBDT model, you again split your train-set into the final train-set and your $2^{nd}$ validation set. Your $1^{st}$ validation set is for cross validation purposes. Your $2^{nd}$ validation set is for GBDT early-stopping.

For your code, I see two potential problems:

If you source of data is channel_train, then both train_gen and valid_gen are getting from the same source but choices are different by the random generator. This is a problem because you do not guarantee your train and your valid data is mutually different.
You called your valid gen two times. The first time it serves as the purpose of point number 2 which I stated in above. The second time it should have served the purpose of testing -- which is what you asked. But again, it may not certain that the test data and the train data do not overlap.

Therefore, you have code that does everything, but you may not have made sure that the train/valid/test data are mutually exclusive.

score 1 · Answer 2 · answered Jan 21 '22 at 21:59

You can use the testing data to perform hyperparameters optimization to see which hyperparameters of your model pipeline work the best. The validation data is then only used once to see how the whole model pipeline works on out of sample data. For this process the test dataset cannot be used again as this data was already used to select the best hyperparameters. The process of testing and validating are the same (i.e. testing the performance of your model on out of sample data) but the stage in the model development in which you apply these concepts are different.

score 1 · Answer 3 · answered Jan 22 '22 at 23:26

This question is evidence that the scientific method has sort of gotten lost in the way the ML world communicates about models, and it causes students to get confused when they enter industry jobs.

Let's illustrate the difference through the metaphor of a soccer team training for the upcoming season.

Most of the time will be spent doing drills for core skills like passing, kicking, defense, etc. and working on formations. This is training data.
After each practice session the team will play a scrimmage game against itself to see how well the players can put their skills into practice. This is validation data.
Finally, the team might do a preseason exhibition match against a neighboring rival to see how ready they are for the season to begin. This is test data.

When you're in school, you often start with a big pile of labelled data and randomly partition it into training, validation, and test datasets. This isn't necessarily bad, but it obscures the different purposes that the datasets serve.

Training data should be cheap, abundant, and diverse - it's what your model is going to use to actually learn. Validation data is there to make sure your model really is getting better during the training process - you don't want a soccer team that's great at drills but terrible at actually playing the game. And test data is a final check that your model can perform in conditions as close as possible to what it will see when it's live. Often testing will look very different from training and validation - for instance, you might train and validate a classifier on labelled input data, but then test the model on some downstream task that uses the classifier as a source of input features.

What is exactly the difference between Validation data and Testing data

3 Answers3