如何在 keras 中添加注意力机制？

Question

I'm currently using this code that i get from one discussion on github Here's the code of the attention mechanism:我目前正在使用从github上的一次讨论中获得的这段代码，这是注意力机制的代码：

_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=False
    )(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)


sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

Is this the correct way to do it?这是正确的方法吗？ i was sort of expecting the existence of time distributed layer since attention mechanism is distributed in every time step of the RNN.我有点期待时间分布层的存在，因为注意力机制分布在 RNN 的每个时间步长中。 I need someone to confirm that this implementation(the code) is a correct implementation of attention mechanism.我需要有人来确认这个实现（代码）是注意力机制的正确实现。 Thank you.谢谢你。

Answer 1

If you want to have an attention along the time dimension, then this part of your code seems correct to me:如果您想关注时间维度，那么这部分代码对我来说似乎是正确的：

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

sent_representation = merge([activations, attention], mode='mul')

You've worked out the attention vector of shape (batch_size, max_length) :您已经计算出形状(batch_size, max_length)的注意力向量：

attention = Activation('softmax')(attention)

I've never seen this code before, so I can't say if this one is actually correct or not:我以前从未见过这个代码，所以我不能说这个代码是否真的正确：

K.sum(xin, axis=-2)

Further reading (you might have a look):进一步阅读（你可以看看）：

Answer 2

Attention mechanism pays attention to different part of the sentence:注意力机制关注句子的不同部分：

activations = LSTM(units, return_sequences=True)(embedded)

And it determines the contribution of each hidden state of that sentence by它通过以下方式确定该句子的每个隐藏状态的贡献

Computing the aggregation of each hidden state attention = Dense(1, activation='tanh')(activations)计算每个隐藏状态的聚合attention = Dense(1, activation='tanh')(activations)
Assigning weights to different state attention = Activation('softmax')(attention)为不同状态分配权重attention = Activation('softmax')(attention)

And finally pay attention to different states:最后注意不同的状态：

sent_representation = merge([activations, attention], mode='mul')

I don't quite understand this part: sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)这部分我不太明白： sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

To understand more, you can refer to this and this , and also this one gives a good implementation, see if you can understand more on your own.想了解更多可以参考this和this ，this one也给出了一个很好的实现，看你自己能不能多了解一些。

Answer 3

Recently I was working with applying attention mechanism on a dense layer and here is one sample implementation:最近我正在研究在密集层上应用注意力机制，这是一个示例实现：

def build_model():
  input_dims = train_data_X.shape[1]
  inputs = Input(shape=(input_dims,))
  dense1800 = Dense(1800, activation='relu', kernel_regularizer=regularizers.l2(0.01))(inputs)
  attention_probs = Dense( 1800, activation='sigmoid', name='attention_probs')(dense1800)
  attention_mul = multiply([ dense1800, attention_probs], name='attention_mul')
  dense7 = Dense(7, kernel_regularizer=regularizers.l2(0.01), activation='softmax')(attention_mul)   
  model = Model(input=[inputs], output=dense7)
  model.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])
  return model

print (model.summary)

model.fit( train_data_X, train_data_Y_, epochs=20, validation_split=0.2, batch_size=600, shuffle=True, verbose=1)

Answer 4

I think you can try the following code to add keras self-attention mechanism with LSTM network我觉得你可以试试下面的代码，用 LSTM 网络添加 keras 自注意力机制

    from keras_self_attention import SeqSelfAttention

    inputs = Input(shape=(length,))
    embedding = Embedding(vocab_size, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False)(inputs)
    lstm = LSTM(num_lstm, input_shape=(X[train].shape[0], X[train].shape[1]), return_sequences=True)(embedding)
    attn = SeqSelfAttention(attention_activation='sigmoid')(lstm)
    Flat = Flatten()(attn)
    dense = Dense(32, activation='relu')(Flat)
    outputs = Dense(3, activation='sigmoid')(dense)
    model = Model(inputs=[inputs], outputs=outputs)
    model.compile(loss='binary_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=10, batch_size=32,  validation_data=(X_val,y_val), shuffle=True)

Answer 5

While many good alternatives are given, I have tried to modify the code YOU have shared to make it work.虽然提供了许多不错的选择，但我已尝试修改您共享的代码以使其工作。 I have also answered your other query that has not been addressed so far:我还回答了您目前尚未解决的其他问题：

Q1.一季度。 Is this the correct way to do it?这是正确的方法吗？ The attention layer itself looks good.注意层本身看起来不错。 No changes needed.无需更改。 The way you have used the output of the attention layer can be slightly simplified and modified to incorporate some recent framework upgrades.您使用注意力层输出的方式可以稍微简化和修改，以包含一些最近的框架升级。

    sent_representation = merge.Multiply()([activations, attention])
    sent_representation = Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation)

You are now good to go!你现在可以走了！

Q2. Q2。 i was sort of expecting the existence of time distributed layer since attention mechanism is distributed in every time step of the RNN我有点期待时间分布层的存在，因为注意力机制分布在 RNN 的每个时间步

No, you don't need a time distributed layer else the weights would be shared across timesteps which is not what you want.不，您不需要时间分布层，否则权重将跨时间步共享，这不是您想要的。

You can refer to: https://towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e for other specific details其他具体细节可以参考： https : //towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e

如何在 keras 中添加注意力机制？

问题描述

5 个解决方案

解决方案1
18 已采纳 2017-06-06 10:28:12

解决方案2
2 2018-05-15 03:56:03

解决方案3
2 2019-07-03 22:49:52

解决方案4
0 2020-07-21 08:51:52

解决方案5
0 2020-12-09 14:25:59

如何在 keras 中添加注意力机制？

问题描述

5 个解决方案

解决方案1 18 已采纳 2017-06-06 10:28:12

解决方案2 2 2018-05-15 03:56:03

解决方案3 2 2019-07-03 22:49:52

解决方案4 0 2020-07-21 08:51:52

解决方案5 0 2020-12-09 14:25:59

解决方案1
18 已采纳 2017-06-06 10:28:12

解决方案2
2 2018-05-15 03:56:03

解决方案3
2 2019-07-03 22:49:52

解决方案4
0 2020-07-21 08:51:52

解决方案5
0 2020-12-09 14:25:59