Train/Test/Validation Set Splitting in Sklearn

Question

How could I randomly split a data matrix and the corresponding label vector into a X_train, X_test, X_val, y_train, y_test, y_val with scikit-learn?

As far as I know, sklearn.model_selection.train_test_split is only capable of splitting into two not into three...

score 246 · Accepted Answer · edited Mar 31 '20 at 16:25

You could just use sklearn.model_selection.train_test_split twice. First to split to train, test and then split train again into validation and train. Something like this:

 X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)

 X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2

score 82 · Answer 2 · edited May 23 '17 at 12:38

82

There is a great answer to this question over on SO that uses numpy and pandas.

The command (see the answer for the discussion):

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

produces a 60%, 20%, 20% split for training, validation and test sets.

edited May 23 '17 at 12:38

Community

1

answered Mar 08 '17 at 13:03

0_0

965
6
5

score 50 · Answer 3 · edited Mar 06 '23 at 09:26

Adding to @hh32's answer, while respecting any predefined proportions such as (75, 15, 10):

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10
train is now 75% of the entire data set
x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)
test is now 10% of the initial data set
validation is now 15% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
print(x_train, x_val, x_test)

score 11 · Answer 4 · edited Nov 20 '18 at 02:03

11

You can use train_test_split twice. I think this is most straightforward.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1)

In this way, train, val, test set will be 60%, 20%, 20% of the dataset respectively.

edited Nov 20 '18 at 02:03

Stephen Rauch

1,831
11
23
34

answered Nov 20 '18 at 01:41

David Jung

211
2
3

score 7 · Answer 5 · answered Feb 01 '17 at 16:04

Most often you will find yourself not splitting it once but in a first step you will split your data in a training and test set. Subsequently you will perform a parameter search incorporating more complex splittings like cross-validation with a 'split k-fold' or 'leave-one-out(LOO)' algorithm.

Jorge Barrios · Answer 6 · 2023-12-07T19:03:51.640

Extension of @hh32's answer with preserved ratios.

# Defines ratios, w.r.t. whole dataset.
ratio_train = 0.8
ratio_val = 0.1
ratio_test = 0.1
Produces test split.
x_remaining, x_test, y_remaining, y_test = train_test_split(
    x, y, test_size=ratio_test)
Adjusts val ratio, w.r.t. remaining dataset.
ratio_remaining = 1 - ratio_test
ratio_val_adjusted = ratio_val / ratio_remaining
Produces train and val splits.
x_train, x_val, y_train, y_val = train_test_split(
    x_remaining, y_remaining, test_size=ratio_val_adjusted)

Since the remaining dataset is reduced after the first split, new ratios for the reduced dataset must be calculated:

$ R_{new} = \frac{R_{old}}{R_{remaining}}$

score 4 · Answer 7 · edited Nov 16 '18 at 14:26

Best answer above does not mention that by separating two times using train_test_split not changing partition sizes won`t give initially intended partition:

x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))

Then the portion of validation and test sets in the x_remain change and could be counted as

new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0 
new_val_size = 1.0 - new_test_size

x_val, x_test = train_test_split(x_remain, test_size=new_test_size)

In this occasion all initial partitions are saved.

score 2 · Answer 8 · answered Nov 16 '21 at 06:22

I would like to summarize all the good and elegant answers.

The sklearn.model_selection.train_test_split is de facto option for train, validation split. However, if you want train,val and test split, then the following code can be used.

(Extending answer from 0_0)

Let's say you want to do a split of 75,15 and 10 percentages. If you have data and labels in the panda dataframe then use the following

# suffle and split
train_df, val_df, test_df = np.split(df.sample(frac=1), [int(.75*len(df)), int(.9*len(df))])

Let's say you have data and labels in 2 different NumPy arrays.

data = np.arange(1000)
data = np.reshape(data,(100,10)) # 100 examples with 10 features
labels = np.arange(100) # assuming 100 different categories
print(data[3])
print(labels[3])
idx = np.random.permutation(len(data)) # get suffeled indices
x,y = data[idx], labels[idx] # uniform suffle of data and label
x_train, x_val, x_test = np.split(x, [int(len(x)0.75), int(len(x)0.9)]) # split of 75:15:10
y_train, y_val, y_test = np.split(y, [int(len(y)0.75), int(len(y)0.9)])
print(len(x_train),len(x_val),len(x_test))
print(x_train[:3])
print(y_train[:3])

Carlos Mougan · Answer 9 · 2022-01-07T13:30:18.417

The most pythonic way of doing this would be (and running this twice, as a nested loop)

>>> import numpy as np
>>> from sklearn.model_selection import ShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])
>>> y = np.array([1, 2, 1, 2, 1, 2])
>>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
>>> rs.get_n_splits(X)
5
>>> print(rs)
ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, train_size=None)
>>> for train_index, test_index in rs.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [1 3 0 4] TEST: [5 2]
TRAIN: [4 0 2 5] TEST: [1 3]
TRAIN: [1 2 4 0] TEST: [3 5]
TRAIN: [3 4 1 0] TEST: [5 2]
TRAIN: [3 5 1 0] TEST: [2 4]
>>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25,
...                   random_state=0)
>>> for train_index, test_index in rs.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [1 3 0] TEST: [5 2]
TRAIN: [4 0 2] TEST: [1 3]
TRAIN: [1 2 4] TEST: [3 5]
TRAIN: [3 4 1] TEST: [5 2]
TRAIN: [3 5 1] TEST: [2 4]

Scikit learn now provides a much more detailed way of doing cross-validation:https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators

There is also the option of KFold that might be what you are looking for:

>>> import numpy as np
>>> from sklearn.model_selection import RepeatedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> random_state = 12883823
>>> rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
>>> for train, test in rkf.split(X):
...     print("%s %s" % (train, test))
...
[2 3] [0 1]
[0 1] [2 3]
[0 2] [1 3]
[1 3] [0 2]

They also now provide graphics that will allow you to visualize the type of train-test split that you are looking for (there are more types of train test split than random)

score 2 · Answer 10 · answered Mar 06 '18 at 17:53

Here's another approach (assumes equal three-way split):

# randomly shuffle the dataframe
df = df.reindex(np.random.permutation(df.index))

# how many records is one-third of the entire dataframe
third = int(len(df) / 3)

# Training set (the top third from the entire dataframe)
train = df[:third]

# Testing set (top half of the remainder two third of the dataframe)
test = df[third:][:third]

# Validation set (bottom one third)
valid = df[-third:]

This can be made more concise but I kept it verbose for explanation purposes.

Tom Hale · Answer 11 · 2019-05-12T04:16:21.540

Given train_frac=0.8, this function creates a 80% / 10% / 10% split:

import sklearn

def data_split(examples, labels, train_frac, random_state=None):
    ''' https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    param data:       Data to be split
    param train_frac: Ratio of train set to whole dataset

    Randomly split dataset, based on these ratios:
        'train': train_frac
        'valid': (1-train_frac) / 2
        'test':  (1-train_frac) / 2

    Eg: passing train_frac=0.8 gives a 80% / 10% / 10% split
    '''

    assert train_frac >= 0 and train_frac <= 1, "Invalid training set fraction"

    X_train, X_tmp, Y_train, Y_tmp = sklearn.model_selection.train_test_split(
                                        examples, labels, train_size=train_frac, random_state=random_state)

    X_val, X_test, Y_val, Y_test   = sklearn.model_selection.train_test_split(
                                        X_tmp, Y_tmp, train_size=0.5, random_state=random_state)

    return X_train, X_val, X_test,  Y_train, Y_val, Y_test

score 2 · Answer 12 · answered Sep 18 '20 at 00:09

How about using numpy random choice

import numpy as np
from sklearn.datasets import load_iris
def ttv_split(X, y = None, train_size = .6, test_size = .2, validation_size = .2, random_state = 42):
    """
    Basic approach using np random choice
    """
    np.random.seed(random_state)
    X = pd.DataFrame(X, columns = ["col_" + str(i) for i in range(X.shape[1])])
    size = sum((train_size,test_size,validation_size))
    n_samples = X.shape[0]
    if  size != 1:
        return f"Size of the dataset must sum up to 100% instead: {size} correct and try again"
    else:
        split_series = np.random.choice(a = ["train","test","validation"], p = [train_size, test_size, validation_size], size = n_samples)
        split_series = pd.Series(split_series)
    X_train, X_test, X_validation = X.iloc[split_series[split_series == &quot;train&quot;].index,:], X.iloc[split_series[split_series == &quot;test&quot;].index,:], X.iloc[split_series[split_series == &quot;validation&quot;].index,:]

    if not y is None:
        y = pd.DataFrame(y,columns=[&quot;target&quot;])

        y_train, y_test, y_validation = y.iloc[split_series[split_series == &quot;train&quot;].index,:], y.iloc[split_series[split_series == &quot;test&quot;].index,:], y.iloc[split_series[split_series == &quot;validation&quot;].index,:]

        return X_train,X_test,X_validation,y_train,y_test,y_validation
    else:
        return X_train,X_test,X_validation



X,y = load_iris(return_Xy = True)
X_train,X_test,X_validation,y_train,y_test,y_validation = ttv_split(X, y)

score 0 · Answer 13 · answered Feb 23 '22 at 11:55

All the answers I see work only if you split two arrays (X and y), which is usually the case, but I found myself needing to split more than two arrays. Therefore I wrote the following function, which can handle arbitrary number of arrays:

def train_test_valid_split(*arrays, test_size: float, valid_size: float, **kwargs):
    first_split = train_test_split(*arrays, test_size=test_size, **kwargs)
    testing_data = first_split[1::2]
    if valid_size == 0:
        training_data = first_split[::2]
        validation_data = []
    else:
        training_validation_data = train_test_split(*first_split[::2], test_size=(valid_size / (1 - test_size)),
                                                    **kwargs)
        training_data = training_validation_data[::2]
        validation_data = training_validation_data[1::2]
return training_data + testing_data + validation_data

score 0 · Answer 14 · answered Mar 22 '22 at 15:34

The easiest way I could think of is to map split fractions to array indices as follows:

train_set = data[:int((len(data)+1)*train_fraction)]
test_set = data[int((len(data)+1)*train_fraction):int((len(data)+1)*(train_fraction+test_fraction))]
val_set = data[int((len(data)+1)*(train_fraction+test_fraction)):]

where data = random.shuffle(data)

score 0 · Answer 15 · answered Jun 19 '24 at 15:24

If the objective is having the same sizes of the validation and test sets, possible solution is employing the fact that parameter rain_sizeoftrain_test_split` may take fractional as well as integer values:

train_size : float or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

This could be implemented as:

q = 0.2
X1, X_test, y1, y_test = train_test_split(X, y, test_size=q)
X_train, X_val, y_train, y_val = train_test_split(X1, y1, test_size=y_test.size)

LayneSadler · Answer 16 · 2020-10-05T02:06:23.897

Run it twice. Here is the math for the 2nd test_size.

Let's say I want {train:0.67, validation:0.13, test:0.20}

The first test_size is 20% which leaves 80% of the original data to be split into validation and training data.

(1.0/(1.0-test_size))*validation_size = second_test_size
(1.0/(1.0-0.20))*0.13 = 0.1625

Also, look into the stratify parameter as that is the real reason to use train_test_split as opposed to selecting random row indices.

score -1 · Answer 17 · answered Apr 17 '24 at 17:34

def train_val_test_split(data, train_ratio=0.75, validation_ratio=0.15, test_ratio=0.10, pred_col='pred', random_state=0):
    dataX = data.drop(columns=pred_col)
    dataY = data[pred_col]
    if (train_ratio + validation_ratio + test_ratio) != 1.0:
        raise Exception("ratios don't add up do 1") 

    x_train, x_test, y_train, y_test = train_test_split(
        dataX, dataY, test_size=1 - train_ratio, random_state=random_state
        )

    x_val, x_test, y_val, y_test = train_test_split(
        x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio), random_state=random_state
        ) 
    return (x_train, y_train), (x_val, y_val), (x_test, y_test)

train, val, test = (x_train, y_train), (x_val, y_val), (x_test, y_test) = train_val_test_split(df, pred_col='Life expectancy', random_state=seed)
```

fuwiak · Answer 18 · 2020-03-31T17:55:57.260

import numpy as np
import pandas as pd

#length of data 
N = 10
scale=2


#generated random data
X, y = np.arange(N*scale).reshape((N, scale)), np.arange(N)

#Works for pandas dataframe too
#You can download titanic.csv from here 
#https://github.com/fuwiak/faster_ds/blob/master/sample_data/titanic.csv

#df = pd.read_csv("titanic.csv", sep="\t")
#X=df[df.columns.difference(["Survived"])]
#y=df["Survived"]



def train_test_val(X, y, train_ratio, test_ratio, val_ratio):
    assert sum([train_ratio, test_ratio, val_ratio])==1.0, "wrong given ratio, all ratios have to sum to 1.0"
    assert X.shape[0]==len(y), "X and y shape mismatch"

    ind_train = int(round(X.shape[0]*train_ratio))
    ind_test = int(round(X.shape[0]*(train_ratio+test_ratio)))

    X_train = X[:ind_train]
    X_test = X[ind_train:ind_test]
    X_val = X[ind_test:]

    y_train = y[:ind_train]
    y_test = y[ind_train:ind_test]
    y_val = y[ind_test:]

    return X_train, X_test, X_val, y_train, y_test, y_val
# put ratio as you wish
X_train, X_test, X_val, y_train, y_test, y_val=train_test_val(X, y, 0.8, 0.1, 0.1)

Train/Test/Validation Set Splitting in Sklearn

18 Answers18

train is now 75% of the entire data set

test is now 10% of the initial data set

validation is now 15% of the initial data set

Produces test split.

Adjusts val ratio, w.r.t. remaining dataset.

Produces train and val splits.

(1.0/(1.0-0.20))*0.13 = 0.1625

Linked