简体   繁体   English

如何解决 pytorch 中 Multi Head Attention 的大小不匹配?

[英]How to solve size mismatch of Multi Head Attention in pytorch?

Learning how to coding Multi Head Attention in pytorch now,现在学习如何在 pytorch 中编码多头注意力,

I can't solve the problem of size_mismatch in case dimension of input tensor have 4 dims.如果输入张量的维度有 4 个维度,我无法解决 size_mismatch 的问题。

I refer to the def and class codes in http://nlp.seas.harvard.edu/2018/04/03/attention.html我参考http://nlp.seas.harvard.edu/2018/04/03/attention.html中的def和class代码

Sorry for the inconvenience,Can you give me advice?很抱歉给您带来不便,您能给我建议吗?


#attention def and class

def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)

    return torch.matmul(p_attn, value), p_attn

# MultiHead Attention class

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]

        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, 
                                 dropout=self.dropout)

        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous() \
             .view(nbatches, -1, self.h * self.d_k)
        return self.linears[-1](x)


# create test_4_dim tensor
X=torch.randn(10,5,64,64)
X=X.view(X.shape[0],X.shape[1],X.shape[2]*X.shape[3])
#X:torch.Size([10, 5, 4096])
query_=X.transpose(2,1)
key_=X
value_=X

print("query:",query_.size())
print("key:",key_.size())
print("value:",value_.size())
#query: torch.Size([10, 4096, 5])
#key: torch.Size([10, 5, 4096])
#value: torch.Size([10, 5, 4096])

multihead_testmodel= MultiHeadedAttention(h=4,d_model=4096,dropout=0.1)
#print(multihead_model)

output=multihead_testmodel(query=query_,key=key_,value=value_)
print("model output:",output.size())

#size mismatch, m1: [40960 x 5], m2: [4096 x 4096] at #../aten/src/TH/generic/THTensorMath.cpp:197

in case tensor seze: torch.randn(5,64,64),this code has no error.如果张量占用:torch.randn(5,64,64),此代码没有错误。

X=torch.randn(5,64,64)
#X=X.view(X.shape[0],X.shape[1],X.shape[2]*X.shape[3])

query_=X.transpose(2,1)
key_=X
value_=X

print("query:",query_.size())
print("key:",key_.size())
print("value:",value_.size())

#query: torch.Size([5, 64, 64])
#key: torch.Size([5, 64, 64])
#value: torch.Size([5, 64, 64])

multihead_model= MultiHeadedAttention(h=4,d_model=64,dropout=0.1)
temp_output=multihead_model(query=query_,key=key_,value=value_)
print(temp_output.size())
#torch.Size([5, 64, 64])

Looks like the code expects to get the same dimensions for query , key , and value , so if you don't transpose it fixes the issue:看起来代码期望为querykeyvalue获得相同的维度,所以如果你不转置它可以解决问题:

query_ = X
key_ = X
value_ = X

You're right that there needs to be a transpose for the attention to work, but the code already handles this by calling key.transpose(-2, -1) in the attention implementation.你说得对,注意力需要转置才能工作,但是代码已经通过在注意力实现中调用key.transpose(-2, -1)来处理这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM