简体   繁体   English

Pytorch 使用不同的张量长度

[英]Pytorch working with different tensors length

I am working with word embeddings, and each phrase has different length.我正在使用词嵌入,每个短语都有不同的长度。 My dataset contains a list of vectors, one vector of size 300 for each word in the phrase.我的数据集包含一个向量列表,一个大小为 300 的向量用于短语中的每个单词。 Maybe one phrase has 20 words, maybe it has 2, etc.也许一个短语有 20 个单词,也许有 2 个,等等。

For instance:例如:

X.iloc[0:10, ]
0    [[-0.51389, -0.55286, -0.28648, -0.18608, -0.0...
1    [[0.33837, -0.13933, -0.096114, 0.40103, 0.041...
2    [[-0.078564, 0.18702, -0.35168, 0.067557, 0.11...
3    [[0.047356, -0.10216, -0.15738, -0.04521, 0.26...
4    [[0.16781, -0.31793, -0.21877, 0.28025, 0.3364...
5    [[-0.4509, 0.077681, -0.058347, 0.2859, -0.369...
6    [[0.018223, -0.012323, 0.035569, 0.24232, -0.1...
7    [[-0.19265, 0.45863, -0.33841, -0.16293, -0.26...
8    [[0.10751, 0.15958, 0.13332, 0.16642, -0.03273...
9    [[0.35259, 0.60833, 0.051335, -0.079285, -0.35...
Name: embedding, dtype: object

len(X.iloc[0])
313
len(X.iloc[1])
2

The targets are just a numeric integer, from 0 to 5.目标只是一个数字整数,从 0 到 5。

How can I pad this sequence using pytorch to feed a neural network of fixed size?如何使用 pytorch 填充这个序列来馈送固定大小的神经网络? I saw something with collate_fn , however I think it only works for batches, and not for the whole dataset.我看到了collate_fn的一些东西,但是我认为它只适用于批处理,而不适用于整个数据集。

It seems correct that collate_fn can help in this case. collat​​e_fn在这种情况下可以提供帮助似乎是正确的。 Indeed, it is applied on batch level, but iterating through all batches will process entire dataset.实际上,它应用于批次级别,但遍历所有批次将处理整个数据集。 For padding you can use some of the following functions:对于填充,您可以使用以下一些功能:

torch.nn.functional.pad torch.nn.functional.pad

torch.nn.utils.rnn.pad_sequence torch.nn.utils.rnn.pad_sequence

Doing it manually in a custom way is also an option.以自定义方式手动执行也是一种选择。

Here is toy example with torch.nn.functional.pad and random dataset:这是带有 torch.nn.functional.pad 和随机数据集的玩具示例:

import torch
import random
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset


MAX_PHRASE_LENGTH = 20


class PhraseDataset(Dataset):

    def __len__(self):
        return 5

    def __getitem__(self, item):
        phrase = torch.rand((random.randint(2, MAX_PHRASE_LENGTH), 300), dtype=torch.float)
        target = torch.tensor(random.randint(0, 5), dtype=torch.float)

        return phrase, target


def collate_fn(batch):
    targets = []
    phrases = []
    for (phrase, target) in batch:
        targets.append(target)
        padding = MAX_PHRASE_LENGTH - phrase.size()[0]
        if padding > 0:
            phrases.append(F.pad(phrase, (0, 0, padding, 0), "constant", 0))
        else:
            phrases.append(phrase)

    return torch.stack(phrases), torch.tensor(targets, dtype=torch.float)


dataset = PhraseDataset()
data_loader = DataLoader(dataset, batch_size=5, collate_fn=collate_fn)

for phrases, targets in data_loader:
    print('Phrases size:', phrases.size())

Output:输出:

Phrases size: torch.Size([5, 20, 300])

Although padding could actually be applied in a dataset, before collate_fn is called, so collate_fn won't be needed in this case at all.尽管实际上可以在数据集中应用填充,但在调用collat ​​e_fn 之前,因此在这种情况下根本不需要collat​​e_fn

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM