简体   繁体   中英

PyTorch Dataset / Dataloader batching

I'm a little confused regarding the 'best practise' to implement a PyTorch data pipeline on time series data.

I have a HD5 file which I read using a custom DataLoader. It seems that I should return the data samples as a (features,targets) tuple with the shape of each being (L,C) where L is seq_len and C is number of channels - ie don't preform batching in the data loader, just return as a table.

PyTorch modules seem to require a batch dim, ie Conv1D expects (N, C, L).

I was under the impression that the DataLoader class would prepend the batch dimension but it isn't, I'm getting data shaped (N,L).

dataset = HD5Dataset(args.dataset)

dataloader = DataLoader(dataset,
                        batch_size=N,
                        shuffle=True,
                        pin_memory=is_cuda,
                        num_workers=num_workers)

for i, (x, y) in enumerate(train_dataloader):
    ...

In the code above the shape of x is (N,C) not (1,N,C), which results in the code below (from a public git repo) to fail on the first line.

def forward(self, x):
    """expected input shape is (N, L, C)"""
    x = x.transpose(1, 2).contiguous() # input should have dimension (N, C, L)

The documentation states When automatic batching is enabled It always prepends a new dimension as the batch dimension which leads me to believe that automatic batching is disabled but I don't understand why?

I've found a few things which seem to work, one option seems to be to use the DataLoader's collate_fn but a simpler option is to use a BatchSampler ie

dataset = HD5Dataset(args.dataset)
train, test = train_test_split(list(range(len(dataset))), test_size=.1)

train_dataloader = DataLoader(dataset,
                        pin_memory=is_cuda,
                        num_workers=num_workers,
                        sampler=BatchSampler(SequentialSampler(train),batch_size=len(train), drop_last=True)
                        )

test_dataloader = DataLoader(dataset,
                        pin_memory=is_cuda,
                        num_workers=num_workers,
                        sampler=BatchSampler(SequentialSampler(test),batch_size=len(test), drop_last=True)
                        )

for i, (x, y) in enumerate(train_dataloader):
    print (x,y)

This converts the dataset dim (L, C) into a single batch of (1, L, C) (not particularly efficiently).

If you have a dataset of pairs of tensors (x, y) , where each x is of shape (C,L) , then:

N, C, L = 5, 3, 10
dataset = [(torch.randn(C,L), torch.ones(1)) for i in range(50)]
dataloader = data_utils.DataLoader(dataset, batch_size=N)

for i, (x,y) in enumerate(dataloader):
    print(x.shape)

Will produce (50/N)=10 batches of shape (N,C,L) for x :

torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM