简体   繁体   English

LSTM 上的 Keras 注意力层

[英]Keras attention layer over LSTM

I'm using keras 1.0.1 I'm trying to add an attention layer on top of an LSTM.我正在使用 keras 1.0.1 我试图在 LSTM 之上添加一个注意力层。 This is what I have so far, but it doesn't work.这是我到目前为止所拥有的,但它不起作用。

input_ = Input(shape=(input_length, input_dim))
lstm = GRU(self.HID_DIM, input_dim=input_dim, input_length = input_length, return_sequences=True)(input_)
att = TimeDistributed(Dense(1)(lstm))
att = Reshape((-1, input_length))(att)
att = Activation(activation="softmax")(att)
att = RepeatVector(self.HID_DIM)(att)
merge = Merge([att, lstm], "mul")
hid = Merge("sum")(merge)

last = Dense(self.HID_DIM, activation="relu")(hid)

The network should apply an LSTM over the input sequence.网络应该在输入序列上应用 LSTM。 Then each hidden state of the LSTM should be input into a fully connected layer, over which a Softmax is applied.然后 LSTM 的每个隐藏状态都应该输入到一个全连接层,在该层上应用 Softmax。 The softmax is replicated for each hidden dimension and multiplied by the LSTM hidden states elementwise.为每个隐藏维度复制 softmax,并按元素乘以 LSTM 隐藏状态。 Then the resulting vector should be averaged.然后应该平均得到的向量。

EDIT : This compiles, but I'm not sure if it does what I think it should do.编辑:这可以编译,但我不确定它是否能做我认为应该做的事情。

input_ = Input(shape=(input_length, input_dim))
lstm = GRU(self.HID_DIM, input_dim=input_dim, input_length = input_length, return_sequences=True)(input_)
att = TimeDistributed(Dense(1))(lstm)
att = Flatten()(att)
att = Activation(activation="softmax")(att)
att = RepeatVector(self.HID_DIM)(att)
att = Permute((2,1))(att)
mer = merge([att, lstm], "mul")
hid = AveragePooling1D(pool_length=input_length)(mer)
hid = Flatten()(hid)

Here is an implementation of Attention LSTM with Keras, and an example of its instantiation . 以下是Keras的Attention LSTM的实现,以及它的实例化示例。 I haven't tried it myself, though. 不过,我自己没试过。

The first piece of code you have shared is incorrect.您共享的第一段代码不正确。 The second piece of code looks correct except for one thing.除了一件事外,第二段代码看起来是正确的。 Do not use TimeDistributed as the weights will be the same.不要使用 TimeDistributed,因为权重是相同的。 Use a regular Dense layer with a non linear activation.使用具有非线性激活的常规 Dense 层。


    input_ = Input(shape=(input_length, input_dim))
    lstm = GRU(self.HID_DIM, input_dim=input_dim, input_length = input_length, return_sequences=True)(input_)
    att = Dense(1, activation='tanh')(lstm_out )
    att = Flatten()(att)
    att = Activation(activation="softmax")(att)
    att = RepeatVector(self.HID_DIM)(att)
    att = Permute((2,1))(att)
    mer = merge([att, lstm], "mul")

Now you have the weight adjusted states.现在您拥有权重调整状态。 How you use it is up to you.你如何使用它取决于你。 Most versions of Attention I have seen, just add these up over the time axis and then use the output as the context.我见过的大多数注意力版本,只需将它们在时间轴上相加,然后将输出用作上下文。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM