简体   繁体   中英

how to have a LSTM Autoencoder model over the whole vocab prediction while presenting words as embedding

So I have been working on LSTM Autoencoder model . I have also created various version of this model.

1. create the model using the already trained word embedding: in this scenario, I used the weights of already trained Glove vector, as the weight of features(text data). This is the structure:

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name="input")
    encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(inputs)
    encoded =Lambda(rev_entropy)(encoded)
    decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
    decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True), merge_mode="sum", name="decoder_lstm")(decoded)
    autoencoder = Model(inputs, decoded)
    autoencoder.compile(optimizer="sgd", loss='mse')
    autoencoder.summary()
    checkpoint = ModelCheckpoint(filepath='checkpoint/{epoch}.hdf5')
    history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, validation_data=test_gen, validation_steps=num_test_steps, callbacks=[checkpoint])
  1. in the second scenario, I implemented the word embedding layer in the model itself:

This is the structure:

inputs = Input(shape=(SEQUENCE_LEN, ), name="input")
embedding = Embedding(input_dim=VOCAB_SIZE, output_dim=EMBED_SIZE, input_length=SEQUENCE_LEN,trainable=False)(inputs)
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(embedding)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = LSTM(EMBED_SIZE, return_sequences=True)(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer="sgd", loss='categorical_crossentropy')
autoencoder.summary()   
checkpoint = ModelCheckpoint(filepath=os.path.join('Data/', "simple_ae_to_compare"))
history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS,  validation_steps=num_test_steps)
  1. in the third scenario, I did not use any embedding techniques but used the one hot encoding for the features. and this is the structure of the model:

     `inputs = Input(shape=(SEQUENCE_LEN, VOCAB_SIZE), name="input") encoded = Bidirectional(LSTM(LATENT_SIZE, kernel_initializer="glorot_normal",), merge_mode="sum", name="encoder_lstm")(inputs) encoded = Lambda(score_cooccurance, name='Modified_layer')(encoded) decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded) decoded = LSTM(VOCAB_SIZE, return_sequences=True)(decoded) autoencoder = Model(inputs, decoded) sgd = optimizers.SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True) autoencoder.compile(optimizer=sgd, loss='categorical_crossentropy') autoencoder.summary() checkpoint = ModelCheckpoint(filepath='checkpoint/50/{epoch}.hdf5') history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, callbacks=[checkpoint])` 

    As you see, in the first and second model Embed_size in the decoding is the number of neurons in that layer. it causes the output shape of encoder layer becomes [Latent_size, Embed_size] .

    in the third model, the output shape of the encoder is [Latent_size, Vocab_size] .

Now my question

Is it doable to change the structure of the model in a way I have embedding for representing my words to the model, and at the same time having vocab_size in the decoder layer?

I need to have output_shape of encoder layer be [Latent_size, Vocab_size] and at the same time I don't want to represent my features as the one_hot encoding for the obvious reason.

I appreciate if you can share your idea with me. One idea could be adding more layers, consider that with any cost I don't want to have Embed_size in the last layer.

Your questions:

Is it doable to change the structure of the model in a way I have embedding for representing my words to the model, and at the same time having vocab_size in the decoder layer?

I like to use as reference the Tensorflow transformer model: https://github.com/tensorflow/models/tree/master/official/transformer

In language translation tasks the model input tends to be a token index, which then is subject to an embedding lookup resulting in a shape of (sequence_length, embedding_dims); the encoder itself works on this shape. The decoder output tends to be in the shape of (sequence_length, embedding_dims) also. For instance the model above, then transforms the decoder output into logits by doing a dot product between the output and the embedding vectors. This is the transformation they use: https://github.com/tensorflow/models/blob/master/official/transformer/model/embedding_layer.py#L94

I would recommend an approach similar to the language translation models:

  • pre-stage:
    • input_shape=(sequence_length, 1) [ ie token_index in [0.. vocab_size)
  • encoder:
    • input_shape=(sequence_length, embedding_dims)
    • output_shape=(latent_dims)
  • decoder:
    • input_shape=(latent_dims)
    • output_shape=(sequence_length, embedding_dims)

Pre-processing converts token indexes into embedding_dims. This can be used to generate both the encoder input as well as the decoder targets.

Post processing to convert embedding_dims to logits (in the vocab_index space).

I need to have output_shape of encoder layer be [Latent_size, Vocab_size] and at the same time I don't want to represent my features as the one_hot encoding for the obvious reason.

That doesn't sound right. Typically what one is trying to achieve with an auto-encoder is to have a embedding vector for the sentence. So the output of the encoder in typically [latent_dims]. The output of the decoder needs to be translatable into [sequence_length, vocab_index (1) ] which is typically done by converting from embedding space to logits and then taking the argmax to convert to token index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM