How to build a attention model with keras?

Question

I am trying to understand attention model and also build one myself. After many searches I came across this website which had an atteniton model coded in keras and also looks simple. But when I tried to build that same model in my machine its giving multiple argument error. The error was due to the mismatched argument passing in class Attention . In the website's attention class it's asking for one argument but it initiates the attention object with two arguments.

import tensorflow as tf

max_len = 200
rnn_cell_size = 128
vocab_size=250

class Attention(tf.keras.Model):
    def __init__(self, units):
        super(Attention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)
    def call(self, features, hidden):
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
        attention_weights = tf.nn.softmax(self.V(score), axis=1)
        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights

sequence_input = tf.keras.layers.Input(shape=(max_len,), dtype='int32')

embedded_sequences = tf.keras.layers.Embedding(vocab_size, 128, input_length=max_len)(sequence_input)

lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM
                                     (rnn_cell_size,
                                      dropout=0.3,
                                      return_sequences=True,
                                      return_state=True,
                                      recurrent_activation='relu',
                                      recurrent_initializer='glorot_uniform'), name="bi_lstm_0")(embedded_sequences)

lstm, forward_h, forward_c, backward_h, backward_c = tf.keras.layers.Bidirectional \
    (tf.keras.layers.LSTM
     (rnn_cell_size,
      dropout=0.2,
      return_sequences=True,
      return_state=True,
      recurrent_activation='relu',
      recurrent_initializer='glorot_uniform'))(lstm)

state_h = tf.keras.layers.Concatenate()([forward_h, backward_h])
state_c = tf.keras.layers.Concatenate()([forward_c, backward_c])

#  PROBLEM IN THIS LINE
context_vector, attention_weights = Attention(lstm, state_h)

output = keras.layers.Dense(1, activation='sigmoid')(context_vector)

model = keras.Model(inputs=sequence_input, outputs=output)

# summarize layers
print(model.summary())

How can I make this model work?

Answer 1

There is a problem with the way you initialize attention layer and pass parameters. You should specify the number of attention layer units in this place and modify the way of passing in parameters：

context_vector, attention_weights = Attention(32)(lstm, state_h)

The result:

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 200)          0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 200, 128)     32000       input_1[0][0]                    
__________________________________________________________________________________________________
bi_lstm_0 (Bidirectional)       [(None, 200, 256), ( 263168      embedding[0][0]                  
__________________________________________________________________________________________________
bidirectional (Bidirectional)   [(None, 200, 256), ( 394240      bi_lstm_0[0][0]                  
                                                                 bi_lstm_0[0][1]                  
                                                                 bi_lstm_0[0][2]                  
                                                                 bi_lstm_0[0][3]                  
                                                                 bi_lstm_0[0][4]                  
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 256)          0           bidirectional[0][1]              
                                                                 bidirectional[0][3]              
__________________________________________________________________________________________________
attention (Attention)           [(None, 256), (None, 16481       bidirectional[0][0]              
                                                                 concatenate[0][0]                
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 1)            257         attention[0][0]                  
==================================================================================================
Total params: 706,146
Trainable params: 706,146
Non-trainable params: 0
__________________________________________________________________________________________________
None

Answer 2

Attention layers are part of Keras API of Tensorflow(2.1) now. But it outputs the same sized tensor as your "query" tensor.

This is how to use Luong-style attention:

query_attention = tf.keras.layers.Attention()([query, value])

And Bahdanau-style attention :

query_attention = tf.keras.layers.AdditiveAttention()([query, value])

The adapted version:

attention_weights = tf.keras.layers.Attention()([lstm, state_h])

Check out the original website for more information: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention https://www.tensorflow.org/api_docs/python/tf/keras/layers/AdditiveAttention

Answer 3

To answer Arman's specific query - these libraries use post-2018 semantics of queries, values and keys. To map the semantics back to Bahdanau or Luong's paper, you can consider the 'query' to be the last decoder hidden state. The 'values' will be the set of the encoder outputs - all the hidden states of the encoder. The 'query' 'attends' to all the 'values'.

Whichever version of code or library you are using, always note that the 'query' will be expanded over the time axis to prepare it for the subsequent addition that follows. This value (that is being expanded) will always be the last hidden state of the RNN. The other value will always be the values that need to be attended to - all the hidden states at the encoder end. This simple check of the code can be done to determine what 'query' and 'values' map to irrespective of the library or code that you are using.

You can refer to https://towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e to write your own custom attention layer in less than 6 lines of code

How to build a attention model with keras?

Question

3 answers

solution1
9 ACCPTED 2019-07-09 11:35:52

solution2
9 2020-02-29 20:32:33

solution3
3 2020-11-19 08:32:38

How to build a attention model with keras?

Question

3 answers

solution1 9 ACCPTED 2019-07-09 11:35:52

solution2 9 2020-02-29 20:32:33

solution3 3 2020-11-19 08:32:38

solution1
9 ACCPTED 2019-07-09 11:35:52

solution2
9 2020-02-29 20:32:33

solution3
3 2020-11-19 08:32:38