简体   繁体   English

如何在多维序列到序列中使用 PyTorch Transformer?

[英]How to use the PyTorch Transformer with multi-dimensional sequence-to-seqence?

I'm trying to go seq2seq with a Transformer model.我正在尝试使用 Transformer 模型进行seq2seq My input and output are the same shape ( torch.Size([499, 128]) where 499 is the sequence length and 128 is the number of features.我的输入和输出是相同的形状( torch.Size([499, 128]) ,其中 499 是序列长度,128 是特征数。

My input looks like:我的输入看起来像: 在此处输入图片说明

My output looks like:我的输出看起来像: 在此处输入图片说明

My training loop is:我的训练循环是:

    for batch in tqdm(dataset):
        optimizer.zero_grad()
        x, y = batch

        x = x.to(DEVICE)
        y = y.to(DEVICE)

        pred = model(x, torch.zeros(x.size()).to(DEVICE))

        loss = loss_fn(pred, y)
        loss.backward()
        optimizer.step()

My model is:我的模型是:

import math
from typing import final
import torch
import torch.nn as nn

class Reconstructor(nn.Module):
    def __init__(self, input_dim, output_dim, dim_embedding, num_layers=4, nhead=8, dim_feedforward=2048, dropout=0.5):
        super(Reconstructor, self).__init__()

        self.model_type = 'Transformer'
        self.src_mask = None
        self.pos_encoder = PositionalEncoding(d_model=dim_embedding, dropout=dropout)
        self.transformer = nn.Transformer(d_model=dim_embedding, nhead=nhead, dim_feedforward=dim_feedforward, num_encoder_layers=num_layers, num_decoder_layers=num_layers)
        self.decoder = nn.Linear(dim_embedding, output_dim)
        self.decoder_act_fn = nn.PReLU()

        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, src, tgt):

        pe_src = self.pos_encoder(src.permute(1, 0, 2))  # (seq, batch, features)
        transformer_output = self.transformer_encoder(pe_src)
        decoder_output = self.decoder(transformer_output.permute(1, 0, 2)).squeeze(2)
        decoder_output = self.decoder_act_fn(decoder_output)
        return decoder_output

My output has a shape of torch.Size([32, 499, 128]) where 32 is batch, 499 is my sequence length and 128 is the number of features.我的输出形状为torch.Size([32, 499, 128]) ,其中32是批处理, 499是我的序列长度, 128是特征数。 But the output has the same values:但输出具有相同的值:

tensor([[[0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         ...,
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017],
         [0.0014, 0.0016, 0.0017,  ..., 0.0018, 0.0021, 0.0017]]],
       grad_fn=<PreluBackward>)

What am I doing wrong?我究竟做错了什么? Thank you so much for any help.非常感谢您的帮助。

There are several points to be checked.有几个点需要检查。 As you have same output to the different inputs, I suspect that some layer zeros out all it's inputs.由于您对不同的输入有相同的输出,我怀疑某些层会将其所有输入归零。 So check the outputs of the PositionalEncoding and also Encoder block of the Transformer, to make sure they are not constant.因此,请检查 PositionalEncoding 的输出以及 Transformer 的 Encoder 块,以确保它们不是恒定的。 But before that, make sure your inputs differ (try to inject noise, for example).但在此之前,请确保您的输入不同(例如,尝试注入噪音)。

Additionally, from what I see in the pictures, your input and output are speech signals and was sampled at 22.05kHz (I guess), so it should have ~10k features, but you claim that you have only 128. This is another place to check.此外,从我在图片中看到的内容来看,您的输入和输出是语音信号,并以 22.05kHz 采样(我猜),所以它应该有大约 10k 个特征,但您声称您只有 128 个。这是另一个地方查看。 Now, the number 499 represent some time slice.现在,数字 499 代表某个时间片。 Make sure your slices are in reasonable range (20-50 msec, usually 30).确保您的切片在合理的范围内(20-50 毫秒,通常为 30)。 If it is the case, then 30ms by 500 is 15 seconds, which is much more you have in your example.如果是这种情况,那么 30 毫秒乘以 500 是 15 秒,这在您的示例中要多得多。 And finally you are masking off a third of a second of speech in your input, which is too much I believe.最后,你在你的输入中掩盖了三分之一秒的语音,我相信这太过分了。

I think it would be useful to examine Wav2vec and Wav2vec 2.0 papers, which tackle the problem of self supervised training in speech recognition domain using Transformer Encoder with great success.我认为检查Wav2vecWav2vec 2.0论文会很有用,它们使用 Transformer Encoder 解决了语音识别领域中的自监督训练问题,并取得了巨大成功。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM