Is there a more efficient way of retrieving batches from a hdf5 dataset?

Question

I have a dataclass for Pytorch dataloading. It retrieves items from a hdf5 archive (150k samples) before I feed this into a dataloader and train a small one hidden layer autoencoder. However, when I try to train my network, nothing happens, there is no GPU utilization. I am using, 4 CPUs and 2 GPUs to start off with.

My batch size is 128 and I use 8 workers when I start training.

I have also followed Pytorchs dataparallel tutorial. Below is my code for the hdf5 dataclass.

import torch.multiprocessing as mp
mp.set_start_method('fork') 

from torch.utils import data
import h5py
import time

class Features_Dataset(data.Dataset):
    def __init__(self, file_path, phase):
        self.file_path = file_path
        self.archive = None
        self.phase = phase 
        with h5py.File(file_path, 'r', libver='latest', swmr=True) as f:
           self.length = len(f[(self.phase) + '_labels'])


    def _get_archive(self):
        if self.archive is None:
            self.archive = h5py.File(self.file_path, 'r', libver='latest', swmr=True)
            assert self.archive.swmr_mode
        return self.archive


    def __getitem__(self, index):
        archive = self._get_archive()
        label = archive[str(self.phase) + '_labels']
        datum = archive[str(self.phase) + '_all_arrays']
        path = archive[str(self.phase) + '_img_paths']

        return datum[index], label[index], path[index]

    def __len__(self):
        return self.length

    def close(self):
        self.archive.close()

if __name__ == '__main__':
    train_dataset = Features_Dataset(file_path= "featuresdata/train.hdf5", phase= 'train')
    trainloader = data.DataLoader(train_dataset, num_workers=8, batch_size=1)
    print(len(trainloader))
    myStart = time.time()
    for i, (data, label, path) in enumerate(trainloader):
        print(path)

This is my class for the autoencoder:

import torch
import torch.nn as nn

class AutoEncoder(nn.Module):
    def __init__(self, n_embedded):
        super(AutoEncoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(6144, n_embedded))
        self.decoder = nn.Sequential(nn.Linear(n_embedded, 6144))

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)

        return encoded, decoded

This is how I initialize the model:

device = torch.device("cuda") 
    # Initialize / load checkpoint
    model = AutoEncoder(2048)
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
    model= nn.DataParallel(model) 
    model.to(device)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(),weight_decay=1e-5)

I make sure that my inputs are put into the device too.

Could the speed of retrieving the batches be the problem? In regards to the hdf5 Features Dataset class, I am attempting to lazily load the hdf5 dataset without using __init__ however, I think maybe calculating the length of the dataset may be the issue...

Answer 1

The issue may be the bottleneck caused by the lazy loading. You can try to load all the data at init of the dataloader (if you have enough resources). Then, at getitem just return the datum[index], label[index], path[index] from already prepared lists. Hope it helps. Good luck!

Is there a more efficient way of retrieving batches from a hdf5 dataset?

Question

1 answers

solution1
0 2020-06-04 22:25:52

Is there a more efficient way of retrieving batches from a hdf5 dataset?

Question

1 answers

solution1 0 2020-06-04 22:25:52

solution1
0 2020-06-04 22:25:52