[英]How to deal with large vocab_size when training a Language Model in Keras?
I want to train a language model in Keras, by this tutorial: https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/我想通过本教程在 Keras 中训练语言 model: https://machinelearningmastery.com/develop-word-ker-based-neural-language-models
My input is composed of: lines num: 4823744 maximum line: 20 Vocabulary Size: 790609 Total Sequences: 2172328 Max Sequence Length: 11我的输入由以下组成:行数:4823744 最大行:20 词汇量:790609 总序列:2172328 最大序列长度:11
As you can see by this lines:如您所见:
num_words = 50
tokenizer = Tokenizer(num_words=num_words, lower=True)
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
I'm using the tokenizer with num_words=50.我正在使用 num_words=50 的标记器。 The vocab_size is taken from the tokenizer, but it's still the bigger size (790K).
vocab_size 取自标记器,但它仍然是更大的大小(790K)。
Therefore this line:因此这一行:
y = to_categorical(y, num_classes=vocab_size)
Causes a memory error.导致 memory 错误。
This is the model definition:这是 model 定义:
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
How can I deal with it?我该如何处理?
I do want to have word-level model and not char-level.我确实想要字级 model 而不是字符级。 And I do want to take at least 10K of the most common words.
而且我确实想提取至少 10K 最常用的单词。
I thought about filtering words before hand, but it may cause the language model to learn false sequences.我之前考虑过过滤单词,但这可能会导致语言 model 学习到错误的序列。
How can I solve it?我该如何解决?
Thanks谢谢
Fasttext is a better way to compute embeddings for large vocabularies - it does not need a dictionary entry for every word. Fasttext 是一种更好的计算大型词汇嵌入的方法——它不需要每个单词都有一个字典条目。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.