Refers to general procedures that attempt to determine the generalizability of a statistical result. Cross-validation arises frequently in the context of assessing how a particular model fit predicts future observations. Methods for cross-validation usually involve withholding a random subset of the data during model fitting and quantifying how accurate the withheld data are predicted and repeating this process to get a measure of prediction accuracy.
Questions tagged [cross-validation]
646 questions
206
votes
18 answers
Train/Test/Validation Set Splitting in Sklearn
How could I randomly split a data matrix and the corresponding label vector into a X_train, X_test, X_val, y_train, y_test, y_val with scikit-learn?
As far as I know, sklearn.model_selection.train_test_split is only capable of splitting into two not…
Hendrik
- 8,767
- 17
- 43
- 55
59
votes
4 answers
What is the difference between bootstrapping and cross-validation?
I used to apply K-fold cross-validation for robust evaluation of my machine learning models. But I'm aware of the existence of the bootstrapping method for this purpose as well. However, I cannot see the main difference between them in terms of…
Fredrik
- 1,047
- 3
- 10
- 12
48
votes
2 answers
How does the validation_split parameter of Keras' fit function work?
Validation-split in Keras Sequential model fit function is documented as following on https://keras.io/models/sequential/ :
validation_split: Float between 0 and 1. Fraction of the training data
to be used as validation data. The model will set…
rnso
- 1,608
- 3
- 19
- 35
40
votes
3 answers
Why use both validation set and test set?
Consider a neural network:
For a given set of data, we divide it into training, validation and test set. Suppose we do it in the classic 60:20:20 ratio, then we prevent overfitting by validating the network by checking it on validation set. Then…
user1825567
- 1,416
- 1
- 14
- 24
37
votes
2 answers
How to use the output of GridSearch?
I'm currently working with Python and Scikit learn for classification purposes, and doing some reading around GridSearch I thought this was a great way for optimising my estimator parameters to get the best results.
My methodology is this:
Split my…
Dee Carter
- 1,752
- 1
- 13
- 26
35
votes
3 answers
Does modeling with Random Forests require cross-validation?
As far as I've seen, opinions tend to differ about this. Best practice would certainly dictate using cross-validation (especially if comparing RFs with other algorithms on the same dataset). On the other hand, the original source states that the…
neuron
- 664
- 1
- 6
- 9
35
votes
6 answers
Merging multiple data frames row-wise in PySpark
I have 10 data frames pyspark.sql.dataframe.DataFrame, obtained from randomSplit as (td1, td2, td3, td4, td5, td6, td7, td8, td9, td10) = td.randomSplit([.1, .1, .1, .1, .1, .1, .1, .1, .1, .1], seed = 100) Now I want to join 9 td's into a single…
krishna Prasad
- 1,147
- 1
- 14
- 23
33
votes
2 answers
How to calculate the fold number (k-fold) in cross validation?
I am confused about how I choose the number of folds (in k-fold CV) when I apply cross validation to check the model. Is it dependent on data size or other parameters?
Taimur Islam
- 951
- 4
- 12
- 17
28
votes
4 answers
Cross validation Vs. Train Validate Test
I have a doubt regarding the cross validation approach and train-validation-test approach.
I was told that I can split a dataset into 3 parts:
Train: we train the model.
Validation: we validate and adjust model parameters.
Test: never seen before…
NaveganTeX
- 455
- 1
- 4
- 9
19
votes
3 answers
What is the proper way to use early stopping with cross-validation?
I am not sure what is the proper way to use early stopping with cross-validation for a gradient boosting algorithm. For a simple train/valid split, we can use the valid dataset as the evaluation dataset for the early stopping and when refitting we…
amine456
- 191
- 1
- 1
- 4
15
votes
1 answer
Stratify on regression
I have worked in classification problems, and stratified cross-validation is one of the most useful and simple techniques I've found. In that case, what it means is to build a training and validation set that have the same prorportions of classes of…
David Masip
- 6,136
- 2
- 28
- 62
15
votes
2 answers
Can overfitting occur even with validation loss still dropping?
I have a convolutional + LSTM model in Keras, similar to this (ref 1), that I am using for a Kaggle contest. Architecture is shown below. I have trained it on my labeled set of 11000 samples (two classes, initial prevalence is ~9:1, so I upsampled…
DeusXMachina
- 263
- 1
- 2
- 6
15
votes
3 answers
How to choose a classifier after cross-validation?
When we do k-fold cross validation, should we just use the classifier that has the highest test accuracy? What is generally the best approach in getting a classifier from cross validation?
Armon Safai
- 419
- 1
- 6
- 12
14
votes
2 answers
Validation vs. test vs. training accuracy. Which one should I compare for claiming overfit?
I have read on the several answers here and on the Internet that cross-validation helps to indicate that if the model will generalize well or not and about overfitting.
But I am confused that which two accuracies/errors amoung…
A.B
- 336
- 1
- 3
- 12
13
votes
2 answers
Cross-validation: K-fold vs Repeated random sub-sampling
I wonder which type of model cross-validation to choose for classification problem: K-fold or random sub-sampling (bootstrap sampling)?
My best guess is to use 2/3 of the data set (which is ~1000 items) for training and 1/3 for validation.
In this…
IgorS
- 5,474
- 11
- 34
- 43