[英]Pytorch - How to add a self-attention to another architecture
I'm a beginner with pytorch framework and I'm trying to add a multiheaded self attention on top of another architecture (BERT) (this is a simple question but I'm not familiar with PyTorch):我是 pytorch 框架的初学者,我正在尝试在另一个架构(BERT)之上添加多头自我关注(这是一个简单的问题,但我不熟悉 PyTorch):
UPDATE 1更新 1
import math
class PositionalEncoding(nn.Module):
def __init__(self, d_model, dropout=0.1, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
self.d_model = d_model
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
def forward(self, x, seq_len = 768, mask = None):
pos_emb = self.pe[:, :seq_len]
x = x * mask[:, :, None].float()
x = x + pos_emb
return x
The problem in how to add the transformer is in the following class:如何添加变压器的问题在于以下class:
class CamemBERTQA(nn.Module):
def __init__(self,bert_type, hidden_size, num_labels, num_inter_layers=1, heads = 12, do_lower_case = True):
super(CamemBERTQA, self).__init__()
self.do_lower_case = do_lower_case
self.bert_type = bert_type
self.hidden_size = hidden_size
self.num_labels = num_labels
self.num_inter_layers = num_inter_layers
self.camembert = CamembertModel.from_pretrained(self.bert_type)
# ---------------- Transformer ------------------------------------------
self.d_model = self.hidden_size # 768
dropout = 0.1
self.pos_emb = PositionalEncoding(d_model = self.d_model, dropout = dropout)
self.transformer_inter = nn.ModuleList(
[nn.TransformerEncoderLayer(d_model = self.d_model, nhead = heads, dim_feedforward = 2048, dropout = dropout)
for _ in range(num_inter_layers)])
# ---------------- Transformer ------------------------------------------
self.qa_outputs = nn.Linear(self.hidden_size, self.num_labels)
def forward(self, input_ids, mask=None):
bert_output = self.camembert(input_ids = input_ids) # input_ids is a tensor
# ---------------- Transformer ------------------------------------------
seq_len = self.hidden_size
x = self.pos_emb(x = bert_output, seq_len = seq_len, mask = None)
for i in range(self.num_inter_layers):
x = self.transformer_inter[i](i, x, x, 1 - mask) # all_tokens * max_tokens * dim
output = self.layer_norm(x)
# ---------------- Transformer ------------------------------------------
sequence_output = output[0]
logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1)
end_logits = end_logits.squeeze(-1)
outputs = (start_logits, end_logits,)
return x
Thank you so much.太感谢了。
So it seems that you're trying to add a Transformer network on top of the BERT component.因此,您似乎正在尝试在 BERT 组件之上添加一个 Transformer 网络。 It has to be mentioned that the self-attention network is only a part of the Transformer network, meaning that Transformers have other components besides self-attention as well.
需要指出的是,self-attention 网络只是 Transformer 网络的一部分,这意味着 Transformer 除了 self-attention 之外还有其他的组件。 I would recommend using the Transformer (which has the self-attention component included) as an encoder that receives BERT vectors and transforms them into another representation (in another space).
我建议使用 Transformer(其中包含自注意力组件)作为接收 BERT 向量并将它们转换为另一种表示形式(在另一个空间中)的编码器。
Try this instead of self.attention = MultiHeadAttention()
:试试这个而不是
self.attention = MultiHeadAttention()
:
self.transformer_inter = nn.ModuleList(
[TransformerEncoderLayer(d_model, heads, d_ff, dropout)
for _ in range(num_inter_layers)])
and then in forward()
, call self.transformer_inter
through a loop which will give you the representations produced by Transformer architecture.然后在
forward()
中,通过循环调用self.transformer_inter
,这将为您提供 Transformer 架构产生的表示。 Like this:像这样:
def forward(self, bert_output, mask):
batch_size, seq_len = bert_output.size(0), bert_output.size(1)
# Transformer Encoder
pos_emb = self.pos_emb.pe[:, :seq_len]
x = bert_output * mask[:, :, None].float()
x = x + pos_emb
for i in range(self.num_inter_layers):
x = self.transformer_inter[i](i, x, x, 1 - mask) # all_tokens * max_tokens * dim
x = self.layer_norm(x) # Transformer also normalizes the outputs from each layer.
# x is the encoded vectors by Transformer encoder
return x
Then using a nn.Linear(.)
layer, do another transformation to map the hidden_size
to the number of labels for your task, which will give you the logits for each label.然后使用
nn.Linear(.)
层,对 map 进行另一次转换,将hidden_size
转换为任务的标签数量,这将为您提供每个 label 的 logits。 These all should be done within BERT
class that you have posted.这些都应该在您发布的
BERT
class 内完成。
Note that the TransformerEncoderLayer
is a placeholder class that I used above.请注意,
TransformerEncoderLayer
是我在上面使用的占位符 class。 So you have to either implement it or use open source packages.所以你必须要么实现它,要么使用开源包。 As Transformers are quite well-known, I think you won't have trouble finding an implementation of it.
由于 Transformer 非常有名,我认为您不会有任何问题可以找到它的实现。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.