如何在 Keras 中实现用于文档分类的分层 Transformer？

Question

Hierarchical attention mechanism for document classification has been presented by Yang et al. Yang等人提出了文档分类的分层注意机制。 https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf

Its implementation is available on https://github.com/ShawnyXiao/TextClassification-Keras它的实现在https://github.com/ShawnyXiao/TextClassification-Keras上可用

Also, the implementation of the document classification with Transformer is available on https://keras.io/examples/nlp/text_classification_with_transformer此外，在https://keras.io/examples/nlp/text_classification_with_transformer上可以使用 Transformer 实现文档分类

But, it's not hierarchical.但是，它不是分层的。

I have googled a lot but didn't find any implementation of a hierarchical Transformer.我用谷歌搜索了很多，但没有找到分层 Transformer 的任何实现。 Does anyone know how to implement a hierarchical transformer for document classification in Keras?有谁知道如何在 Keras 中实现用于文档分类的分层转换器？

My implementation is as follows.我的实现如下。 Note that the implementation extended from Nandan implementation for document classification.请注意，该实现从 Nandan 实现扩展为文档分类。 https://keras.io/examples/nlp/text_classification_with_transformer . https://keras.io/examples/nlp/text_classification_with_transformer 。

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.utils.np_utils import to_categorical


class MultiHeadSelfAttention(layers.Layer):
    def __init__(self, embed_dim, num_heads=8):
        super(MultiHeadSelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        if embed_dim % num_heads != 0:
            raise ValueError(
                f"embedding dimension = {embed_dim} should be divisible by number of heads = {num_heads}"
            )
        self.projection_dim = embed_dim // num_heads
        self.query_dense = layers.Dense(embed_dim)
        self.key_dense = layers.Dense(embed_dim)
        self.value_dense = layers.Dense(embed_dim)
        self.combine_heads = layers.Dense(embed_dim)

    def attention(self, query, key, value):
        score = tf.matmul(query, key, transpose_b=True)
        dim_key = tf.cast(tf.shape(key)[-1], tf.float32)
        scaled_score = score / tf.math.sqrt(dim_key)
        weights = tf.nn.softmax(scaled_score, axis=-1)
        output = tf.matmul(weights, value)
        return output, weights

    def separate_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, inputs):
        # x.shape = [batch_size, seq_len, embedding_dim]
        batch_size = tf.shape(inputs)[0]
        query = self.query_dense(inputs)  # (batch_size, seq_len, embed_dim)
        key = self.key_dense(inputs)  # (batch_size, seq_len, embed_dim)
        value = self.value_dense(inputs)  # (batch_size, seq_len, embed_dim)
        query = self.separate_heads(
            query, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        key = self.separate_heads(
            key, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        value = self.separate_heads(
            value, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        attention, weights = self.attention(query, key, value)
        attention = tf.transpose(
            attention, perm=[0, 2, 1, 3]
        )  # (batch_size, seq_len, num_heads, projection_dim)
        concat_attention = tf.reshape(
            attention, (batch_size, -1, self.embed_dim)
        )  # (batch_size, seq_len, embed_dim)
        output = self.combine_heads(
            concat_attention
        )  # (batch_size, seq_len, embed_dim)
        return output

    def compute_output_shape(self, input_shape):
        # it does not change the shape of its input
        return input_shape


class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate, name=None):
        super(TransformerBlock, self).__init__(name=name)
        self.att = MultiHeadSelfAttention(embed_dim, num_heads)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim), ]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

    def compute_output_shape(self, input_shape):
        # it does not change the shape of its input
        return input_shape


class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim, name=None):
        super(TokenAndPositionEmbedding, self).__init__(name=name)
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

    def compute_output_shape(self, input_shape):
        # it changes the shape from (batch_size, maxlen) to (batch_size, maxlen, embed_dim)
        return input_shape + (self.pos_emb.output_dim,)



# Lower level (produce a representation of each sentence):

embed_dim = 100  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 64  # Hidden layer size in feed forward network inside transformer
L1_dense_units = 100  # Size of the sentence-level representations output by the word-level model
dropout_rate = 0.1
vocab_size = 1000
class_number = 5
max_docs = 10000
max_sentences = 15
max_words = 60

word_input = layers.Input(shape=(max_words,), name='word_input')
word_embedding = TokenAndPositionEmbedding(maxlen=max_words, vocab_size=vocab_size,
                                           embed_dim=embed_dim, name='word_embedding')(word_input)
word_transformer = TransformerBlock(embed_dim=embed_dim, num_heads=num_heads, ff_dim=ff_dim,
                                    dropout_rate=dropout_rate, name='word_transformer')(word_embedding)
word_pool = layers.GlobalAveragePooling1D(name='word_pooling')(word_transformer)
word_drop = layers.Dropout(dropout_rate, name='word_drop')(word_pool)
word_dense = layers.Dense(L1_dense_units, activation="relu", name='word_dense')(word_drop)
word_encoder = keras.Model(word_input, word_dense)

word_encoder.summary()

# =========================================================================
# Upper level (produce a representation of each document):

L2_dense_units = 100

sentence_input = layers.Input(shape=(max_sentences, max_words), name='sentence_input')

sentence_encoder = tf.keras.layers.TimeDistributed(word_encoder, name='sentence_encoder')(sentence_input)

sentence_transformer = TransformerBlock(embed_dim=L1_dense_units, num_heads=num_heads, ff_dim=ff_dim,
                               dropout_rate=dropout_rate, name='sentence_transformer')(sentence_encoder)
sentence_pool = layers.GlobalAveragePooling1D(name='sentence_pooling')(sentence_transformer)
sentence_out = layers.Dropout(dropout_rate)(sentence_pool)
preds = layers.Dense(class_number , activation='softmax', name='sentence_output')(sentence_out)

model = keras.Model(sentence_input, preds)
model.summary()

The summary of the model is as follows: model总结如下：

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 word_input (InputLayer)     [(None, 60)]              0         
                                                                 
 word_embedding (TokenAndPos  (None, 60, 100)          106000    
 itionEmbedding)                                                 
                                                                 
 word_transformer (Transform  (None, 60, 100)          53764     
 erBlock)                                                        
                                                                 
 word_pooling (GlobalAverage  (None, 100)              0         
 Pooling1D)                                                      
                                                                 
 word_drop (Dropout)         (None, 100)               0         
                                                                 
 word_dense (Dense)          (None, 100)               10100     
                                                                 
=================================================================
Total params: 169,864
Trainable params: 169,864
Non-trainable params: 0
_________________________________________________________________
Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 sentence_input (InputLayer)  [(None, 15, 60)]         0         
                                                                 
 sentence_encoder (TimeDistr  (None, 15, 100)          169864    
 ibuted)                                                         
                                                                 
 sentence_transformer (Trans  (None, 15, 100)          53764     
 formerBlock)                                                    
                                                                 
 sentence_pooling (GlobalAve  (None, 100)              0         
 ragePooling1D)                                                  
                                                                 
 dropout_9 (Dropout)         (None, 100)               0         
                                                                 
 sentence_output (Dense)     (None, 5)                 505       
                                                                 
=================================================================
Total params: 224,133
Trainable params: 224,133
Non-trainable params: 0

Everything is ok and you can copy and paste these codes in colab to see the summary of the model.一切正常，您可以将这些代码复制并粘贴到 colab 中以查看 model 的摘要。 But, my problem is for positional encoding at the sentence level.但是，我的问题是句子级别的位置编码。 How to apply positional encoding at the sentence level?如何在句子级别应用位置编码？

Answer 1

The implementation is recursive in the sense that you treat the average of your outputs of transformer x as the input to transformer x+1 .在您将变压器x的输出平均值视为变压器x+1的输入的意义上，该实现是递归的。

So let's say your data is structured as (batch, chapter, paragraph, sentence, token).因此，假设您的数据结构为（批次、章节、段落、句子、标记）。

After the first transformation you end up with (batch, chapter, paragraph, sentence, embedding) so then you average and get (batch, chapter, paragraph, sentence_embedding_in).在第一次转换之后，你最终得到 (batch, chapter, paragraph, sentence, embedding) 所以你平均得到 (batch, chapter, paragraph, sentence_embedding_in)。

Apply another transformation and get (batch, chapter, paragraph, sentence_embedding_out).应用另一个转换并获得（batch、chample、paragraph、sentence_embedding_out）。

Average again and get (batch, chapter, paragraph_embedding).再次平均并获得（批次、章节、段落嵌入）。 Rinse & Repeat.冲洗并重复。

The implementation of the paper is actually in a different repository: https://github.com/ematvey/hierarchical-attention-networks论文的实现实际上是在不同的存储库中： https://github.com/ematvey/hierarchical-attention-networks

They actually do something different from what I've described and apply transformers at the bottom and RNN at the top.他们实际上做了一些与我所描述的不同的事情，并在底部应用了转换器，在顶部应用了 RNN。 In theory you could do the opposite or apply RNN at each layer (that would be really slow).从理论上讲，您可以做相反的事情或在每一层应用 RNN（那会非常慢）。 As far as the implementation is concerned you can abstract from that - the principle remains the same: you apply a transformation, average the outputs and feed it into the next higher-level "layer" (or "module" using torch lingo).就实现而言，您可以从中抽象出来 - 原理保持不变：您应用转换，平均输出并将其馈送到下一个更高级别的“层”（或使用火炬术语的“模块”）。

如何在 Keras 中实现用于文档分类的分层 Transformer？

问题描述

1 个解决方案

解决方案1
1 2021-12-08 08:23:01

如何在 Keras 中实现用于文档分类的分层 Transformer？

问题描述

1 个解决方案

解决方案1 1 2021-12-08 08:23:01

解决方案1
1 2021-12-08 08:23:01