简体   繁体   English

Pytorch Dataloader如何处理可变大小的数据?

[英]How does Pytorch Dataloader handle variable size data?

I have a dataset that looks like below. 我有一个如下所示的数据集。 That is the first item is the user id followed by the set of items which is clicked by the user. 这是第一项是用户ID,后跟用户点击的项目集。

0   24104   27359   6684
0   24104   27359
1   16742   31529   31485
1   16742   31529
2   6579    19316   13091   7181    6579    19316   13091
2   6579    19316   13091   7181    6579    19316
2   6579    19316   13091   7181    6579    19316   13091   6579
2   6579    19316   13091   7181    6579
4   19577   21608
4   19577   21608
4   19577   21608   18373
5   3541    9529
5   3541    9529
6   6832    19218   14144
6   6832    19218
7   9751    23424   25067   12606   26245   23083   12606

I define a custom dataset to handle my click log data. 我定义了一个自定义数据集来处理我的点击日志数据。

import torch.utils.data as data
class ClickLogDataset(data.Dataset):
    def __init__(self, data_path):
        self.data_path = data_path
        self.uids = []
        self.streams = []

        with open(self.data_path, 'r') as fdata:
            for row in fdata:
                row = row.strip('\n').split('\t')
                self.uids.append(int(row[0]))
                self.streams.append(list(map(int, row[1:])))

    def __len__(self):
        return len(self.uids)

    def __getitem__(self, idx):
        uid, stream = self.uids[idx], self.streams[idx]
        return uid, stream

Then I use a DataLoader to retrieve mini batches from the data for training. 然后我使用DataLoader从数据中检索小批量进行培训。

from torch.utils.data.dataloader import DataLoader
clicklog_dataset = ClickLogDataset(data_path)
clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)

for uid_batch, stream_batch in stream_data_loader:
    print(uid_batch)
    print(stream_batch)

The code above returns differently from what I expected, I want stream_batch to be a 2D tensor of type integer of length 16 . 上面的代码与我的预期不同,我希望stream_batch成为长度为16整数类型的2D张量。 However, what I get is a list of 1D tensor of length 16, and the list has only one element, like below. 但是,我得到的是长度为16的1D张量列表,列表中只有一个元素,如下所示。 Why is that ? 这是为什么 ?

#stream_batch
[tensor([24104, 24104, 16742, 16742,  6579,  6579,  6579,  6579, 19577, 19577,
        19577,  3541,  3541,  6832,  6832,  9751])]

As @Jatentaki suggested, I wrote my custom collate function and it worked fine. 正如@Jatentaki建议的那样,我编写了自定义整理功能,并且工作正常。

def get_max_length(x):
    return len(max(x, key=len))

def pad_sequence(seq):
    def _pad(_it, _max_len):
        return [0] * (_max_len - len(_it)) + _it
    return [_pad(it, get_max_length(seq)) for it in seq]

def custom_collate(batch):
    transposed = zip(*batch)
    lst = []
    for samples in transposed:
        if isinstance(samples[0], int):
            lst.append(torch.LongTensor(samples))
        elif isinstance(samples[0], float):
            lst.append(torch.DoubleTensor(samples))
        elif isinstance(samples[0], collections.Sequence):
            lst.append(torch.LongTensor(pad_sequence(samples)))
    return lst

stream_dataset = StreamDataset(data_path)
stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset,                                                         
                                                            batch_size=batch_size,                                            
                                                        collate_fn=custom_collate,
                                                        shuffle=False)

So how do you handle the fact that your samples are of different length? 那么你如何处理样品长度不同的事实呢? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. torch.utils.data.DataLoader有一个collate_fn参数,用于将样本列表转换为批处理。 By default it does this to lists. 默认情况下,它会执行操作。 You can write your own collate_fn , which for instance 0 -pads the input, truncates it to some predefined length or applies any other operation of your choice. 您可以编写自己的collate_fn ,例如0 -pads输入,将其截断为某个预定义的长度或应用您选择的任何其他操作。

This is the way I do it: 这就是我这样做的方式:

def collate_fn_padd(batch):
    '''
    Padds batch of variable length

    note: it converts things ToTensor manually here since the ToTensor transform
    assume it takes in images rather than arbitrary tensors.
    '''
    ## get sequence lengths
    lengths = torch.tensor([ t.shape[0] for t in batch ]).to(device)
    ## padd
    batch = [ torch.Tensor(t).to(device) for t in batch ]
    batch = torch.nn.utils.rnn.pad_sequence(batch)
    ## compute mask
    mask = (batch != 0).to(device)
    return batch, lengths, mask

then I pass that to the dataloader class as a collate_fn . 然后我将它作为collate_fn传递给dataloader类。


There seems to be a giant list of different posts in the pytorch forum. 在pytorch论坛中似乎有一个巨大的不同帖子列表。 Let me link to all of them. 让我链接到所有这些。 They all have answers of their own and discussions. 他们都有自己的答案和讨论。 It doesn't seem to me that there is one "standard way to do it" but if there is from an authoritative reference please share. 在我看来,有一种“标准方法”,但如果有来自权威参考,请分享。

It would be nice that the ideal answer mentions 理想的答案提到会很好

  • efficiency, eg if to do the processing in GPU with torch in the collate function vs numpy 效率,例如,如果要使用整理功能与numpy中的火炬进行GPU处理

things of that sort. 那种事情。

List: 列表:

bucketing: - https://discuss.pytorch.org/t/tensorflow-esque-bucket-by-sequence-length/41284 分组: - https://discuss.pytorch.org/t/tensorflow-esque-bucket-by-sequence-length/41284

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM