Pytorch - 如何将自注意力添加到另一个架构

Question

I'm a beginner with pytorch framework and I'm trying to add a multiheaded self attention on top of another architecture (BERT) (this is a simple question but I'm not familiar with PyTorch):我是 pytorch 框架的初学者，我正在尝试在另一个架构（BERT）之上添加多头自我关注（这是一个简单的问题，但我不熟悉 PyTorch）：

UPDATE 1更新 1

import math
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        self.d_model = d_model

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x, seq_len = 768, mask = None):
        pos_emb = self.pe[:, :seq_len]
        x = x * mask[:, :, None].float()
        x = x + pos_emb
        return x

The problem in how to add the transformer is in the following class:如何添加变压器的问题在于以下class：

class CamemBERTQA(nn.Module):
   def __init__(self,bert_type, hidden_size, num_labels, num_inter_layers=1, heads = 12, do_lower_case = True):
       super(CamemBERTQA, self).__init__()
       self.do_lower_case = do_lower_case
       self.bert_type = bert_type
       self.hidden_size = hidden_size
       self.num_labels = num_labels       
       self.num_inter_layers = num_inter_layers
       self.camembert = CamembertModel.from_pretrained(self.bert_type)

       # ---------------- Transformer ------------------------------------------
       self.d_model = self.hidden_size # 768
       dropout = 0.1
       self.pos_emb = PositionalEncoding(d_model = self.d_model, dropout = dropout)
       self.transformer_inter = nn.ModuleList(
           [nn.TransformerEncoderLayer(d_model = self.d_model, nhead = heads, dim_feedforward = 2048, dropout = dropout)
            for _ in range(num_inter_layers)])
       # ---------------- Transformer ------------------------------------------

       self.qa_outputs = nn.Linear(self.hidden_size, self.num_labels)



   def forward(self, input_ids, mask=None):
       bert_output = self.camembert(input_ids = input_ids) # input_ids is a tensor

       # ---------------- Transformer ------------------------------------------
       seq_len = self.hidden_size
       x = self.pos_emb(x = bert_output, seq_len = seq_len, mask = None)

       for i in range(self.num_inter_layers):
           x = self.transformer_inter[i](i, x, x, 1 - mask)  # all_tokens * max_tokens * dim
       output = self.layer_norm(x)
       # ---------------- Transformer ------------------------------------------

       sequence_output = output[0]
       logits = self.qa_outputs(sequence_output)
       start_logits, end_logits = logits.split(1, dim=-1)
       start_logits = start_logits.squeeze(-1)
       end_logits = end_logits.squeeze(-1)
       outputs = (start_logits, end_logits,)
       return x

Thank you so much.太感谢了。

Answer 1

So it seems that you're trying to add a Transformer network on top of the BERT component.因此，您似乎正在尝试在 BERT 组件之上添加一个 Transformer 网络。 It has to be mentioned that the self-attention network is only a part of the Transformer network, meaning that Transformers have other components besides self-attention as well.需要指出的是，self-attention 网络只是 Transformer 网络的一部分，这意味着 Transformer 除了 self-attention 之外还有其他的组件。 I would recommend using the Transformer (which has the self-attention component included) as an encoder that receives BERT vectors and transforms them into another representation (in another space).我建议使用 Transformer（其中包含自注意力组件）作为接收 BERT 向量并将它们转换为另一种表示形式（在另一个空间中）的编码器。

Try this instead of self.attention = MultiHeadAttention() :试试这个而不是self.attention = MultiHeadAttention() ：

self.transformer_inter = nn.ModuleList(
            [TransformerEncoderLayer(d_model, heads, d_ff, dropout)
             for _ in range(num_inter_layers)])

and then in forward() , call self.transformer_inter through a loop which will give you the representations produced by Transformer architecture.然后在forward()中，通过循环调用self.transformer_inter ，这将为您提供 Transformer 架构产生的表示。 Like this:像这样：

def forward(self, bert_output, mask):

    batch_size, seq_len = bert_output.size(0), bert_output.size(1)

    # Transformer Encoder
    pos_emb = self.pos_emb.pe[:, :seq_len]
    x = bert_output * mask[:, :, None].float()
    x = x + pos_emb

    for i in range(self.num_inter_layers):
        x = self.transformer_inter[i](i, x, x, 1 - mask)  # all_tokens * max_tokens * dim
    x = self.layer_norm(x) # Transformer also normalizes the outputs from each layer.

    # x is the encoded vectors by Transformer encoder

    return x

Then using a nn.Linear(.) layer, do another transformation to map the hidden_size to the number of labels for your task, which will give you the logits for each label.然后使用nn.Linear(.)层，对 map 进行另一次转换，将hidden_size转换为任务的标签数量，这将为您提供每个 label 的 logits。 These all should be done within BERT class that you have posted.这些都应该在您发布的BERT class 内完成。

Note that the TransformerEncoderLayer is a placeholder class that I used above.请注意， TransformerEncoderLayer是我在上面使用的占位符 class。 So you have to either implement it or use open source packages.所以你必须要么实现它，要么使用开源包。 As Transformers are quite well-known, I think you won't have trouble finding an implementation of it.由于 Transformer 非常有名，我认为您不会有任何问题可以找到它的实现。

Pytorch - 如何将自注意力添加到另一个架构

问题描述

1 个解决方案

解决方案1
0 2020-05-07 18:42:44

Pytorch - 如何将自注意力添加到另一个架构

问题描述

1 个解决方案

解决方案1 0 2020-05-07 18:42:44

解决方案1
0 2020-05-07 18:42:44