23

I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader?

I have a dataset that I created and the training data has 20k samples and the labels are also separate. Lets say I want to load a dataset in the model, shuffle each time and use the batch size that I prefer. The Dataloader function does that. How can I combine and put them in the function so that I can train it in the model in pytorch?

Amarnath
  • 361
  • 1
  • 2
  • 5

2 Answers2

17

Assuming both of x_data and labels are lists or numpy arrays,

train_data = []
for i in range(len(x_data)):
   train_data.append([x_data[i], labels[i]])

trainloader = torch.utils.data.DataLoader(train_data, shuffle=True, batch_size=100)
i1, l1 = next(iter(trainloader))
print(i1.shape)
ASHu2
  • 270
  • 2
  • 6
7

I think the standard way is to create a Dataset class object from the arrays and pass the Dataset object to the DataLoader.

One solution is to inherit from the Dataset class and define a custom class that implements __len__() and __get__(), where you pass X and y to the __init__(self,X,y).

For your simple case with two arrays and without the necessity for a special __get__() function beyond taking the values in row i, you can also use transform the arrays into Tensor objects and pass them to TensorDataset.

Run the following code for a self-contained example.

# Create a dataset like the one you describe
from sklearn.datasets import make_classification
X,y = make_classification()

# Load necessary Pytorch packages
from torch.utils.data import DataLoader, TensorDataset
from torch import Tensor

# Create dataset from several tensors with matching first dimension
# Samples will be drawn from the first dimension (rows)
dataset = TensorDataset( Tensor(X), Tensor(y) )

# Create a data loader from the dataset
# Type of sampling and batch size are specified at this step
loader = DataLoader(dataset, batch_size= 3)

# Quick test
next(iter(loader))
Johannes
  • 171
  • 3