Pytorch 使用不同的张量长度

Question

I am working with word embeddings, and each phrase has different length.我正在使用词嵌入，每个短语都有不同的长度。 My dataset contains a list of vectors, one vector of size 300 for each word in the phrase.我的数据集包含一个向量列表，一个大小为 300 的向量用于短语中的每个单词。 Maybe one phrase has 20 words, maybe it has 2, etc.也许一个短语有 20 个单词，也许有 2 个，等等。

For instance:例如：

X.iloc[0:10, ]
0    [[-0.51389, -0.55286, -0.28648, -0.18608, -0.0...
1    [[0.33837, -0.13933, -0.096114, 0.40103, 0.041...
2    [[-0.078564, 0.18702, -0.35168, 0.067557, 0.11...
3    [[0.047356, -0.10216, -0.15738, -0.04521, 0.26...
4    [[0.16781, -0.31793, -0.21877, 0.28025, 0.3364...
5    [[-0.4509, 0.077681, -0.058347, 0.2859, -0.369...
6    [[0.018223, -0.012323, 0.035569, 0.24232, -0.1...
7    [[-0.19265, 0.45863, -0.33841, -0.16293, -0.26...
8    [[0.10751, 0.15958, 0.13332, 0.16642, -0.03273...
9    [[0.35259, 0.60833, 0.051335, -0.079285, -0.35...
Name: embedding, dtype: object

len(X.iloc[0])
313
len(X.iloc[1])
2

The targets are just a numeric integer, from 0 to 5.目标只是一个数字整数，从 0 到 5。

How can I pad this sequence using pytorch to feed a neural network of fixed size?如何使用 pytorch 填充这个序列来馈送固定大小的神经网络？ I saw something with collate_fn , however I think it only works for batches, and not for the whole dataset.我看到了collate_fn的一些东西，但是我认为它只适用于批处理，而不适用于整个数据集。

Answer 1

It seems correct that collate_fn can help in this case. collate_fn在这种情况下可以提供帮助似乎是正确的。 Indeed, it is applied on batch level, but iterating through all batches will process entire dataset.实际上，它应用于批次级别，但遍历所有批次将处理整个数据集。 For padding you can use some of the following functions:对于填充，您可以使用以下一些功能：

torch.nn.functional.pad torch.nn.functional.pad

torch.nn.utils.rnn.pad_sequence torch.nn.utils.rnn.pad_sequence

Doing it manually in a custom way is also an option.以自定义方式手动执行也是一种选择。

Here is toy example with torch.nn.functional.pad and random dataset:这是带有 torch.nn.functional.pad 和随机数据集的玩具示例：

import torch
import random
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset


MAX_PHRASE_LENGTH = 20


class PhraseDataset(Dataset):

    def __len__(self):
        return 5

    def __getitem__(self, item):
        phrase = torch.rand((random.randint(2, MAX_PHRASE_LENGTH), 300), dtype=torch.float)
        target = torch.tensor(random.randint(0, 5), dtype=torch.float)

        return phrase, target


def collate_fn(batch):
    targets = []
    phrases = []
    for (phrase, target) in batch:
        targets.append(target)
        padding = MAX_PHRASE_LENGTH - phrase.size()[0]
        if padding > 0:
            phrases.append(F.pad(phrase, (0, 0, padding, 0), "constant", 0))
        else:
            phrases.append(phrase)

    return torch.stack(phrases), torch.tensor(targets, dtype=torch.float)


dataset = PhraseDataset()
data_loader = DataLoader(dataset, batch_size=5, collate_fn=collate_fn)

for phrases, targets in data_loader:
    print('Phrases size:', phrases.size())

Output:输出：

Phrases size: torch.Size([5, 20, 300])

Although padding could actually be applied in a dataset, before collate_fn is called, so collate_fn won't be needed in this case at all.尽管实际上可以在数据集中应用填充，但在调用collat e_fn 之前，因此在这种情况下根本不需要collate_fn 。

Pytorch 使用不同的张量长度

问题描述

1 个解决方案

解决方案1
0 2022-06-21 19:42:09

Pytorch 使用不同的张量长度

问题描述

1 个解决方案

解决方案1 0 2022-06-21 19:42:09

解决方案1
0 2022-06-21 19:42:09