在 Keras 中训练语言 Model 时如何处理大的 vocab_size？

Question

I want to train a language model in Keras, by this tutorial: https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/我想通过本教程在 Keras 中训练语言 model： https://machinelearningmastery.com/develop-word-ker-based-neural-language-models

My input is composed of: lines num: 4823744 maximum line: 20 Vocabulary Size: 790609 Total Sequences: 2172328 Max Sequence Length: 11我的输入由以下组成：行数：4823744 最大行：20 词汇量：790609 总序列：2172328 最大序列长度：11

As you can see by this lines:如您所见：

num_words = 50
tokenizer = Tokenizer(num_words=num_words, lower=True)
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1

I'm using the tokenizer with num_words=50.我正在使用 num_words=50 的标记器。 The vocab_size is taken from the tokenizer, but it's still the bigger size (790K). vocab_size 取自标记器，但它仍然是更大的大小（790K）。

Therefore this line:因此这一行：

y = to_categorical(y, num_classes=vocab_size)

Causes a memory error.导致 memory 错误。

This is the model definition:这是 model 定义：

model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))

How can I deal with it?我该如何处理？

I do want to have word-level model and not char-level.我确实想要字级 model 而不是字符级。 And I do want to take at least 10K of the most common words.而且我确实想提取至少 10K 最常用的单词。

I thought about filtering words before hand, but it may cause the language model to learn false sequences.我之前考虑过过滤单词，但这可能会导致语言 model 学习到错误的序列。

How can I solve it?我该如何解决？

Thanks谢谢

Answer 1

Fasttext is a better way to compute embeddings for large vocabularies - it does not need a dictionary entry for every word. Fasttext 是一种更好的计算大型词汇嵌入的方法——它不需要每个单词都有一个字典条目。

在 Keras 中训练语言 Model 时如何处理大的 vocab_size？

问题描述

1 个解决方案

解决方案1
0 2019-10-29 07:50:02

在 Keras 中训练语言 Model 时如何处理大的 vocab_size？

问题描述

1 个解决方案

解决方案1 0 2019-10-29 07:50:02

解决方案1
0 2019-10-29 07:50:02