简体   繁体   中英

Pytorch: Batch size is missing in data after torch.utils.random_split() is used on dataloader.dataset

I used random_split() to divide my data into train and test and I observed that if random split is done after the dataloader is created, batch size is missing when getting a batch of data from the dataloader.

import torch
from torchvision import transforms, datasets
from torch.utils.data import random_split

# Normalize the data
transform_image = transforms.Compose([
  transforms.Resize((240, 320)),
  transforms.ToTensor(),
  transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

data = '/data/imgs/train'

def load_dataset():
  data_path = data
  main_dataset = datasets.ImageFolder(
    root = data_path,
    transform = transform_image
  )

  loader = torch.utils.data.DataLoader(
    dataset = main_dataset,
    batch_size= 64,
    num_workers = 0,
    shuffle= True
  )

  # Dataset has 22424 data points
  trainloader, testloader = random_split(loader.dataset, [21000, 1424])

  return trainloader, testloader

trainloader, testloader = load_dataset()

Now to get a single batch of images from the train and test loaders:

images, labels = next(iter(trainloader))
images.shape
# %%
len(trainloader)

# %%
images_test, labels_test = next(iter(testloader))
images_test.shape

# %%
len(testloader)

The output that I get is does not have the batch size for train or test batches. Teh output dims should be [batch x channel x H x W] but I get [channel x H x W].

Output:

在此处输入图片说明

But if I create the split from the dataset and then make two data loaders using the splits, I get the batchsize in the output.

def load_dataset():
    data_path = data
    main_dataset = datasets.ImageFolder(
      root = data_path,
      transform = transform_image
    )
    # Dataset has 22424 data points
    train_data, test_data = random_split(main_dataset, [21000, 1424])

    trainloader = torch.utils.data.DataLoader(
      dataset = train_data,
      batch_size= 64,
      num_workers = 0,
      shuffle= True
    )

    testloader = torch.utils.data.DataLoader(
      dataset = test_data,
      batch_size= 64,
      num_workers= 0,
      shuffle= True
    )

    return trainloader, testloader

trainloader, testloader = load_dataset()

On running the same 4 commands to get a single train and test batch:

在此处输入图片说明

Is the first approach wrong? Although the length shows that the data has been split. So why do I not see the batch size?

The first approach is wrong.

Only DataLoader instances return batches of items. The Dataset like instances don't.

When you call make_split you pass it loader.dataset which is just a reference to main_dataset (not a DataLoader ). The result is that trainloader and testloader are Dataset s not DataLoader s. In fact you discard loader which is your only DataLoader when you return from load_dataset .

The second version is what you should do to get two separate DataLoader s.

You are splitting a dataset into two. This would give you 2 Datasets, which when iterated upon will return single image tensors of shape channel, height, width , ie, 3,h,w , and does not by default give you Dataloader around these datasets.
What you did next is actually the next right step, ie, to create a Dataloader around each dataset. You define the batch size in the Dataloader and now iterating over a Dataloader will return tensors of shape batch_size, channel, height, width .

Even if you intend to feed the model batches of size one, you will have to have a batch size dimension in the tensor. For this, you can either use a Dataloader of batchsize=1 or just add a dummy dimension at the start with torch.unsqueeze(X, 0) for an image X or X.unsqueeze(0) , making the tensor of shape 1,3,h,w

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM