简体   繁体   English

用于句子的 Pytorch 数据加载器

[英]Pytorch dataloader for sentences

I have collected a small dataset for binary text classification and my goal is to train a model with the method proposed by Convolutional Neural Networks for Sentence Classification我收集了一个用于二进制文本分类的小数据集,我的目标是使用卷积神经网络提出的句子分类方法训练模型

I started my implementation by using the torch.util.data.Dataset .我通过使用torch.util.data.Dataset开始了我的实现。 Essentially every sample in my dataset my_data looks like this (as example):基本上,我的数据集my_data每个样本都如下所示(作为示例):

{"words":[0,1,2,3,4],"label":1},
{"words":[4,9,20,30,4,2,3,4,1],"label":0}

Next I took a look at Writing custom dataloaders with pytorch : using:接下来我看了一下用 pytorch 编写自定义数据加载器:使用:

dataloader = DataLoader(my_data, batch_size=2,
                    shuffle=False, num_workers=4)

I would suspect that enumerating over a batch would yield something the following:我怀疑枚举一批会产生以下结果:

{"words":[[0,1,2,3,4],[4,9,20,30,4,2,3,4,1]],"labels":[1,0]}

However it is more like this:然而,它更像是这样:

{"words":[[0,4],[1,9],[2,20],[3,30],[4,4]],"label":[1,0]}

I guess it has something to do that they are not equal size.我想这与它们的大小不相等有关。 Do they need to be the same size and if so how can i achieve it?它们是否需要相同的大小,如果需要,我该如何实现? For people knwoing about this paper, what does your training data look like?对于了解这篇论文的人,你的训练数据是什么样的?

edit:编辑:

class CustomDataset(Dataset):
def __init__(self, path_to_file, max_size=10, transform=None):

    with open(path_to_file) as f:
        self.data = json.load(f)
    self.transform = transform
    self.vocab = self.build_vocab(self.data)
    self.word2idx, self.idx2word = self.word2index(self.vocab)

def get_vocab(self):
    return self.vocab

def get_word2idx(self):
    return self.word2idx, self.idx2word

def __len__(self):
    return len(self.data)

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()
    inputs_ = word_tokenize(self.data[idx][0])
    inputs_ = [w for w in inputs_ if w not in stopwords]
    inputs_ = [w for w in inputs_ if w not in punctuation]
    inputs_ = [self.word2idx[w] for w in inputs_]  # convert words to index

    label = {"positive": 1,"negative": 0}
    label_ = label[self.data[idx][1]] #convert label to 0|1

    sample = {"words": inputs_, "label": label_}

    if self.transform:
        sample = self.transform(sample)

    return sample

def build_vocab(self, corpus):
    word_count = {}
    for sentence in corpus:
        tokens = word_tokenize(sentence[0])
        for token in tokens:
            if token not in word_count:
                word_count[token] = 1
            else:
                word_count[token] += 1
    return word_count

def word2index(self, word_count):
    word_index = {w: i for i, w in enumerate(word_count)}
    idx_word = {i: w for i, w in enumerate(word_count)}
    return word_index, idx_word

As you correctly suspected, this is mostly a problem of different tensor shapes.正如您正确怀疑的那样,这主要是不同张量形状的问题。 Luckily, PyTorch offers you several solutions of varying simplicity to achieve what you desire (batch sizes >= 1 for text samples):幸运的是,PyTorch 为您提供了几种不同简单性的解决方案来实现您的愿望(文本样本的批量大小 >= 1):

  • The highest-level solution is probably torchtext , which provides several solutions out of the box to load (custom) datasets for NLP tasks.最高级别的解决方案可能是torchtext ,它提供了几种开箱即用的解决方案来为 NLP 任务加载(自定义)数据集。 If you can make your training data fit in any one of the described loaders, this is probably the recommended option, as there is a decent documentation and several examples.如果您可以使您的训练数据适合任何一个所描述的加载器,这可能是推荐的选项,因为有一个不错的文档和几个示例。
  • If you prefer to build a solution, there are padding solutions like torch.nn.utils.rnn.pad_sequence , in combination with torch.nn.utils.pack_padded_sequence , or the combination of both ( torch.nn.utils.rnn.pack_sequence . This generally allows you a lot more flexibility, which may or may not be something that you require.如果您更喜欢构建解决方案,可以使用诸如torch.nn.utils.rnn.pad_sequence类的填充解决方案,结合torch.nn.utils.pack_padded_sequence或两者的组合( torch.nn.utils.rnn.pack_sequence 。这通常可以让您获得更大的灵活性,这可能是您需要的,也可能不是。

Personally, I have had good experiences using just pad_sequence , and sacrifice a bit of speed for a much clearer debugging state, and seemingly others have similar recommendations .就我个人而言,我在使用pad_sequence有很好的经验,为了更清晰的调试状态牺牲了一点速度,似乎其他人也有类似的建议

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM