关于KERAS中的文本自动编码器以进行主题建模

Question

简介：我已经在KERAS中为MNIST图像训练了自动编码器（香草和变式），并观察了瓶颈层中的潜在表示将它们聚类的效果如何。

目标：对于短文本，我也想这样做。 特别鸣叫！ 我想使用预先训练的GloVe嵌入基于它们的语义将它们聚类在一起。

我打算做的是先创建一个CNN编码器和一个CNN解码器，然后再转到LSTM / GRU。

问题： ~~~ 正确的损失应该是什么？ 我如何在Keras中实现它？ ~~~

这就是我的KERAS模型的样子

INPUT_TWEET（字索引）>>嵌入层>> CNN_ENCODER >> BOTTLENECK >> CNN_DECODER >> OUTPUT_TWEET（字索引）

Layer (type)                 Output Shape              Param #   
-----------------------------------------------------------------
Input_Layer (InputLayer)     (None, 64)                0         
embedding_1 (Embedding)      (None, 64, 200)           3299400   
enc_DO_0_layer (Dropout)     (None, 64, 200)           0         
enc_C_1 (Conv1D)             (None, 64, 16)            9616      
enc_MP_1 (MaxPooling1D)      (None, 32, 16)            0         
enc_C_2 (Conv1D)             (None, 32, 8)             392       
enc_MP_2 (MaxPooling1D)      (None, 16, 8)             0         
enc_C_3 (Conv1D)             (None, 16, 8)             200       
enc_MP_3 (MaxPooling1D)      (None, 8, 8)              0         
***bottleneck (Flatten)***   (None, 64)                0         
reshape_2 (Reshape)          (None, 8, 8)              0         
dec_C_1 (Conv1D)             (None, 8, 8)              200       
dec_UpS_1 (UpSampling1D)     (None, 16, 8)             0         
dec_C_2 (Conv1D)             (None, 16, 8)             200       
dec_UpS_2 (UpSampling1D)     (None, 32, 8)             0         
dec_C_3 (Conv1D)             (None, 32, 16)            400       
dec_UpS_3 (UpSampling1D)     (None, 64, 16)            0         
conv1d_2 (Conv1D)            (None, 64, 200)           9800      
dense_2 (Dense)              (None, 64, 1)             201       
flatten_2 (Flatten)          (None, 64)                0         
-----------------------------------------------------------------

这显然是错误的，因为它试图最小化输入和输出（字索引）之间的MSE损失，我认为应该在嵌入层（embedding_1和conv1d_2）中做到这一点。

现在我该怎么做？ 是否有意义？ 在Keras中有没有办法做到这一点？ 请在下面检查我的代码：

编码：

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32',name="Input_Layer")
embedded_sequences = embedding_layer(sequence_input)
embedded_sequences1 = Dropout(0.5, name="enc_DO_0_layer")(embedded_sequences)

x = Conv1D(filters=16, kernel_size=3, activation='relu', padding='same',name="enc_C_1")(embedded_sequences1)
x = MaxPooling1D(pool_size=2, padding='same',name='enc_MP_1')(x)
x = Conv1D(filters=8, kernel_size=3, activation='relu', padding='same',name="enc_C_2")(x)
x = MaxPooling1D(pool_size=2, padding='same',name="enc_MP_2")(x)
x = Conv1D(filters=8, kernel_size=3, activation='relu', padding='same',name="enc_C_3")(x)
x = MaxPooling1D(pool_size=2, padding='same',name="enc_MP_3")(x)

encoded = Flatten(name="bottleneck")(x)
x = Reshape((8, 8))(encoded)

x = Conv1D(filters=8, kernel_size=3, activation='relu', padding='same',name="dec_C_1")(x)
x = UpSampling1D(2,name="dec_UpS_1")(x)
x = Conv1D(8, 3, activation='relu', padding='same',name="dec_C_2")(x)
x = UpSampling1D(2,name="dec_UpS_2")(x)
x = Conv1D(16, 3, activation='relu',padding='same',name="dec_C_3")(x)
x = UpSampling1D(2,name="dec_UpS_3")(x)
decoded = Conv1D(200, 3, activation='relu', padding='same')(x)
y = Dense(1)(decoded)
y = Flatten()(y)

autoencoder = Model(sequence_input, y)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

autoencoder.fit(x = tweet_word_indexes ,y = tweet_word_indexes,
            epochs=10,
            batch_size=128,
            validation_split=0.2)

不希望它这样做：

显然，由于损失严重，它只是试图重建单词索引数组（零填充）。

Input  = [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1641 13 2 309 932 1 10 5 6]  
Output = [ -0.31552997 -0.53272009 -0.60824025 -1.14802313 -1.14597917 -1.08642125 -1.10040164 -1.19442761 -1.19560885 -1.19008029 -1.19456315 -1.2288748 -1.22721946 -1.20107424 -1.20624077 -1.24017036 -1.24014354 -1.2400831 -1.24004364 -1.23963416 -1.23968709 -1.24039733 -1.24027216 -1.23946059 -1.23946059 -1.23946059 -1.23946059 -1.23946059 -1.23946059 -1.23946059 -1.23946059 -1.23946059 -1.23946059 -1.14516866 -1.20557368 -1.5288837 -1.48179781 -1.05906188 -1.17691648 -1.94568193 -1.85741842 -1.30418646 -0.83358657 -1.61638248 -1.17812908 0.53077424 0.79578459 -0.40937367 0.35088596 1.29912627 -5.49394751 -27.1003418 -1.06875408 33.78763962 109.41391754 242.43798828 251.05577087 300.13430786 267.90420532 178.17596436 132.06596375 60.63394928 82.10819244 91.18526459]

问题： 这对您有意义吗？ 正确的损失应该是什么？ 我如何在Keras中实现它？

Answer 1

首先，您不应该尝试在模型的末尾获得索引（索引是不可区分的，并且不遵循逻辑连续路径）。

您可能应该使用一键编码的单词来完成模型。 然后，将“ softmax”与“ categorical_crossentropy”一起使用。 （尽管不确定这是否是最好的解决方案）

最后一层应该是Dense(dicWordCount) 。
因此，您可以将索引转换为一键向量，并将其作为输出传递。

oneHotOutput = np.zeros((tweetCount,length,dicWordCount))
auxIndices = np.arange(length)

#inputIndices has shape (tweets,length), it's your array of indices
for i, tweet in zip(range(tweetCount),inputIndices):
    oneHotOutput[i][auxIndices,tweet] = 1

创建模型，然后： model.fit(inputIndices,oneHotOutput,...)

关于KERAS中的文本自动编码器以进行主题建模

问题描述

1 个解决方案

解决方案1
0 2018-04-10 19:32:10

关于KERAS中的文本自动编码器以进行主题建模

问题描述

1 个解决方案

解决方案1 0 2018-04-10 19:32:10

解决方案1
0 2018-04-10 19:32:10