206

How could I randomly split a data matrix and the corresponding label vector into a X_train, X_test, X_val, y_train, y_test, y_val with scikit-learn?

As far as I know, sklearn.model_selection.train_test_split is only capable of splitting into two not into three...

Arun
  • 3
  • 3
Hendrik
  • 8,767
  • 17
  • 43
  • 55

18 Answers18

246

You could just use sklearn.model_selection.train_test_split twice. First to split to train, test and then split train again into validation and train. Something like this:

 X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)

 X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2
hh32
  • 2,832
  • 1
  • 11
  • 9
82

There is a great answer to this question over on SO that uses numpy and pandas.

The command (see the answer for the discussion):

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

produces a 60%, 20%, 20% split for training, validation and test sets.

0_0
  • 965
  • 6
  • 5
50

Adding to @hh32's answer, while respecting any predefined proportions such as (75, 15, 10):

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

train is now 75% of the entire data set

x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)

test is now 10% of the initial data set

validation is now 15% of the initial data set

x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))

print(x_train, x_val, x_test)

Andrei Florea
  • 601
  • 5
  • 3
11

You can use train_test_split twice. I think this is most straightforward.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1)

In this way, train, val, test set will be 60%, 20%, 20% of the dataset respectively.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
David Jung
  • 211
  • 2
  • 3
7

Most often you will find yourself not splitting it once but in a first step you will split your data in a training and test set. Subsequently you will perform a parameter search incorporating more complex splittings like cross-validation with a 'split k-fold' or 'leave-one-out(LOO)' algorithm.

JLT
  • 171
  • 1
  • 3
7

Extension of @hh32's answer with preserved ratios.

# Defines ratios, w.r.t. whole dataset.
ratio_train = 0.8
ratio_val = 0.1
ratio_test = 0.1

Produces test split.

x_remaining, x_test, y_remaining, y_test = train_test_split( x, y, test_size=ratio_test)

Adjusts val ratio, w.r.t. remaining dataset.

ratio_remaining = 1 - ratio_test ratio_val_adjusted = ratio_val / ratio_remaining

Produces train and val splits.

x_train, x_val, y_train, y_val = train_test_split( x_remaining, y_remaining, test_size=ratio_val_adjusted)

Since the remaining dataset is reduced after the first split, new ratios for the reduced dataset must be calculated:

$ R_{new} = \frac{R_{old}}{R_{remaining}}$

Jorge Barrios
  • 191
  • 1
  • 6
4

Best answer above does not mention that by separating two times using train_test_split not changing partition sizes won`t give initially intended partition:

x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))

Then the portion of validation and test sets in the x_remain change and could be counted as

new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0 
new_val_size = 1.0 - new_test_size

x_val, x_test = train_test_split(x_remain, test_size=new_test_size)

In this occasion all initial partitions are saved.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
A.Ametov
  • 141
  • 3
2

I would like to summarize all the good and elegant answers.

The sklearn.model_selection.train_test_split is de facto option for train, validation split. However, if you want train,val and test split, then the following code can be used.

(Extending answer from 0_0)

  1. Let's say you want to do a split of 75,15 and 10 percentages. If you have data and labels in the panda dataframe then use the following
# suffle and split
train_df, val_df, test_df = np.split(df.sample(frac=1), [int(.75*len(df)), int(.9*len(df))])
  1. Let's say you have data and labels in 2 different NumPy arrays.
data = np.arange(1000)
data = np.reshape(data,(100,10)) # 100 examples with 10 features
labels = np.arange(100) # assuming 100 different categories

print(data[3]) print(labels[3])

idx = np.random.permutation(len(data)) # get suffeled indices x,y = data[idx], labels[idx] # uniform suffle of data and label

x_train, x_val, x_test = np.split(x, [int(len(x)0.75), int(len(x)0.9)]) # split of 75:15:10 y_train, y_val, y_test = np.split(y, [int(len(y)0.75), int(len(y)0.9)])

print(len(x_train),len(x_val),len(x_test)) print(x_train[:3]) print(y_train[:3])

2

The most pythonic way of doing this would be (and running this twice, as a nested loop)

>>> import numpy as np
>>> from sklearn.model_selection import ShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])
>>> y = np.array([1, 2, 1, 2, 1, 2])
>>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
>>> rs.get_n_splits(X)
5
>>> print(rs)
ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, train_size=None)
>>> for train_index, test_index in rs.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [1 3 0 4] TEST: [5 2]
TRAIN: [4 0 2 5] TEST: [1 3]
TRAIN: [1 2 4 0] TEST: [3 5]
TRAIN: [3 4 1 0] TEST: [5 2]
TRAIN: [3 5 1 0] TEST: [2 4]
>>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25,
...                   random_state=0)
>>> for train_index, test_index in rs.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [1 3 0] TEST: [5 2]
TRAIN: [4 0 2] TEST: [1 3]
TRAIN: [1 2 4] TEST: [3 5]
TRAIN: [3 4 1] TEST: [5 2]
TRAIN: [3 5 1] TEST: [2 4]

Scikit learn now provides a much more detailed way of doing cross-validation:https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators

There is also the option of KFold that might be what you are looking for:

>>> import numpy as np
>>> from sklearn.model_selection import RepeatedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> random_state = 12883823
>>> rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
>>> for train, test in rkf.split(X):
...     print("%s %s" % (train, test))
...
[2 3] [0 1]
[0 1] [2 3]
[0 2] [1 3]
[1 3] [0 2]

They also now provide graphics that will allow you to visualize the type of train-test split that you are looking for (there are more types of train test split than random)

CV

Carlos Mougan
  • 6,430
  • 2
  • 20
  • 51
2

Here's another approach (assumes equal three-way split):

# randomly shuffle the dataframe
df = df.reindex(np.random.permutation(df.index))

# how many records is one-third of the entire dataframe
third = int(len(df) / 3)

# Training set (the top third from the entire dataframe)
train = df[:third]

# Testing set (top half of the remainder two third of the dataframe)
test = df[third:][:third]

# Validation set (bottom one third)
valid = df[-third:]

This can be made more concise but I kept it verbose for explanation purposes.

Vishal
  • 268
  • 2
  • 5
2

Given train_frac=0.8, this function creates a 80% / 10% / 10% split:

import sklearn

def data_split(examples, labels, train_frac, random_state=None):
    ''' https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    param data:       Data to be split
    param train_frac: Ratio of train set to whole dataset

    Randomly split dataset, based on these ratios:
        'train': train_frac
        'valid': (1-train_frac) / 2
        'test':  (1-train_frac) / 2

    Eg: passing train_frac=0.8 gives a 80% / 10% / 10% split
    '''

    assert train_frac >= 0 and train_frac <= 1, "Invalid training set fraction"

    X_train, X_tmp, Y_train, Y_tmp = sklearn.model_selection.train_test_split(
                                        examples, labels, train_size=train_frac, random_state=random_state)

    X_val, X_test, Y_val, Y_test   = sklearn.model_selection.train_test_split(
                                        X_tmp, Y_tmp, train_size=0.5, random_state=random_state)

    return X_train, X_val, X_test,  Y_train, Y_val, Y_test
Tom Hale
  • 201
  • 2
  • 5
2

How about using numpy random choice

import numpy as np
from sklearn.datasets import load_iris

def ttv_split(X, y = None, train_size = .6, test_size = .2, validation_size = .2, random_state = 42): """ Basic approach using np random choice """ np.random.seed(random_state) X = pd.DataFrame(X, columns = ["col_" + str(i) for i in range(X.shape[1])]) size = sum((train_size,test_size,validation_size)) n_samples = X.shape[0] if size != 1: return f"Size of the dataset must sum up to 100% instead: {size} correct and try again" else: split_series = np.random.choice(a = ["train","test","validation"], p = [train_size, test_size, validation_size], size = n_samples) split_series = pd.Series(split_series)

    X_train, X_test, X_validation = X.iloc[split_series[split_series == &quot;train&quot;].index,:], X.iloc[split_series[split_series == &quot;test&quot;].index,:], X.iloc[split_series[split_series == &quot;validation&quot;].index,:]

    if not y is None:
        y = pd.DataFrame(y,columns=[&quot;target&quot;])

        y_train, y_test, y_validation = y.iloc[split_series[split_series == &quot;train&quot;].index,:], y.iloc[split_series[split_series == &quot;test&quot;].index,:], y.iloc[split_series[split_series == &quot;validation&quot;].index,:]

        return X_train,X_test,X_validation,y_train,y_test,y_validation
    else:
        return X_train,X_test,X_validation


X,y = load_iris(return_Xy = True)

X_train,X_test,X_validation,y_train,y_test,y_validation = ttv_split(X, y)

Multivac
  • 3,199
  • 2
  • 10
  • 26
0

All the answers I see work only if you split two arrays (X and y), which is usually the case, but I found myself needing to split more than two arrays. Therefore I wrote the following function, which can handle arbitrary number of arrays:

def train_test_valid_split(*arrays, test_size: float, valid_size: float, **kwargs):
    first_split = train_test_split(*arrays, test_size=test_size, **kwargs)
    testing_data = first_split[1::2]
    if valid_size == 0:
        training_data = first_split[::2]
        validation_data = []
    else:
        training_validation_data = train_test_split(*first_split[::2], test_size=(valid_size / (1 - test_size)),
                                                    **kwargs)
        training_data = training_validation_data[::2]
        validation_data = training_validation_data[1::2]
return training_data + testing_data + validation_data

0

The easiest way I could think of is to map split fractions to array indices as follows:

train_set = data[:int((len(data)+1)*train_fraction)]
test_set = data[int((len(data)+1)*train_fraction):int((len(data)+1)*(train_fraction+test_fraction))]
val_set = data[int((len(data)+1)*(train_fraction+test_fraction)):]

where data = random.shuffle(data)

Coddy
  • 101
  • 1
0

If the objective is having the same sizes of the validation and test sets, possible solution is employing the fact that parameter rain_sizeoftrain_test_split` may take fractional as well as integer values:

train_size : float or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

This could be implemented as:

q = 0.2
X1, X_test, y1, y_test = train_test_split(X, y, test_size=q)
X_train, X_val, y_train, y_val = train_test_split(X1, y1, test_size=y_test.size)
Roger V.
  • 119
  • 6
0

Run it twice. Here is the math for the 2nd test_size.

Let's say I want {train:0.67, validation:0.13, test:0.20}

The first test_size is 20% which leaves 80% of the original data to be split into validation and training data.

(1.0/(1.0-test_size))*validation_size = second_test_size

(1.0/(1.0-0.20))*0.13 = 0.1625

Also, look into the stratify parameter as that is the real reason to use train_test_split as opposed to selecting random row indices.

LayneSadler
  • 549
  • 6
  • 17
-1
def train_val_test_split(data, train_ratio=0.75, validation_ratio=0.15, test_ratio=0.10, pred_col='pred', random_state=0):
    dataX = data.drop(columns=pred_col)
    dataY = data[pred_col]
    if (train_ratio + validation_ratio + test_ratio) != 1.0:
        raise Exception("ratios don't add up do 1") 

    x_train, x_test, y_train, y_test = train_test_split(
        dataX, dataY, test_size=1 - train_ratio, random_state=random_state
        )

    x_val, x_test, y_val, y_test = train_test_split(
        x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio), random_state=random_state
        ) 
    return (x_train, y_train), (x_val, y_val), (x_test, y_test)

train, val, test = (x_train, y_train), (x_val, y_val), (x_test, y_test) = train_val_test_split(df, pred_col='Life expectancy', random_state=seed)
```
-1
import numpy as np
import pandas as pd

#length of data 
N = 10
scale=2


#generated random data
X, y = np.arange(N*scale).reshape((N, scale)), np.arange(N)

#Works for pandas dataframe too
#You can download titanic.csv from here 
#https://github.com/fuwiak/faster_ds/blob/master/sample_data/titanic.csv

#df = pd.read_csv("titanic.csv", sep="\t")
#X=df[df.columns.difference(["Survived"])]
#y=df["Survived"]



def train_test_val(X, y, train_ratio, test_ratio, val_ratio):
    assert sum([train_ratio, test_ratio, val_ratio])==1.0, "wrong given ratio, all ratios have to sum to 1.0"
    assert X.shape[0]==len(y), "X and y shape mismatch"

    ind_train = int(round(X.shape[0]*train_ratio))
    ind_test = int(round(X.shape[0]*(train_ratio+test_ratio)))

    X_train = X[:ind_train]
    X_test = X[ind_train:ind_test]
    X_val = X[ind_test:]

    y_train = y[:ind_train]
    y_test = y[ind_train:ind_test]
    y_val = y[ind_test:]

    return X_train, X_test, X_val, y_train, y_test, y_val
# put ratio as you wish
X_train, X_test, X_val, y_train, y_test, y_val=train_test_val(X, y, 0.8, 0.1, 0.1) 
fuwiak
  • 1,373
  • 8
  • 14
  • 26