简体   繁体   English

使用嵌入的Keras LSTM语言模型

[英]Keras LSTM Language Model using Embeddings

I am doing a language model using keras. 我正在使用keras做语言模型。

Basically, my vocabulary size N is ~30.000, I already trained a word2vec on it, so I use the embeddings, followed by LSTM, and then I predict the next word with a fully connected layer followed by softmax. 基本上,我的词汇量N是〜30.000,我已经在上面训练了word2vec,所以我使用了嵌入,然后使用LSTM,然后使用完全连接的层和s​​oftmax预测下一个单词。 My model is written as below : 我的模型如下所示:

EMBEDDING_DIM = 256
embedding_layer = Embedding(N,EMBEDDING_DIM,weights=[embeddings],
trainable=False)

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(EMBEDDING_DIM))
model.add(Dense(N))   
model.add(Activation('softmax')) 

model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

I have two questions : 我有两个问题:

  1. in this case, can you confirm that we only use the last hidden layer of the LSTM (which is followed by the fully connected layer and softmax) and there isn't something like a max/mean-pooling of successive hidden layers of the lstm (like here for sentiment analysis http://deeplearning.net/tutorial/lstm.html ) ? 在这种情况下,您是否可以确认我们仅使用LSTM的最后一个隐藏层(其后是完全连接的层和s​​oftmax),并且没有类似lstm的连续隐藏层的最大/平均池化之类的东西(例如此处用于情感分析http://deeplearning.net/tutorial/lstm.html )?

  2. What do you think, instead of connecting the last hidden layer of the lstm to a big fully connected layer of size N (30.000), connecting to a layer of size EMBEDDING_DIM, and predicting the embedding of the next word instead of the word itself, in which case we replace the loss by something like mse, reducing training time, and mainly "helping" our model because the vocabulary is big and embeddings can be useful also for the end of the network ? 您怎么看,而不是将lstm的最后一个隐藏层连接到大小为N(30.000)的大型完全连接层,而是连接到大小为EMBEDDING_DIM的层,并预测下一个单词的嵌入,而不是单词本身,在这种情况下,我们用诸如mse之类的方法代替损失,减少训练时间,并且主要是“帮助”我们的模型,因为词汇量很大,并且嵌入对于网络的末端也可能有用?

Thanks ! 谢谢 !

I can only reply for sure to the first question: 我只能肯定地回答第一个问题:

Yes, the output of the LSTM layer is the last hidden unit. 是的,LSTM层的输出是最后一个隐藏单元。 It only returns all the hidden states if you give it the parameter return_sequences=True . 如果您给它提供参数return_sequences=True它将仅返回所有隐藏状态。 It is set to False by default. 默认情况下将其设置为False。

For the second question, I can only say that I have tried to predict the embeddings insetead of the one-hot vector representation of the word but it gave me bad results. 对于第二个问题,我只能说我已经尝试预测该单词的一键向量表示的嵌入,但是这给了我不好的结果。 Words are still categorical variable, even if we can somehow approximate them by a continuous representation. 即使我们可以通过连续表示以某种方式近似单词,单词仍然是类别变量。 People have put a lot of effort into developping Hierachical Softmax for this reason. 因此,人们在开发Hierachical Softmax上付出了很多努力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM