简体   繁体   English

在 Keras 中训练语言 Model 时如何处理大的 vocab_size?

[英]How to deal with large vocab_size when training a Language Model in Keras?

I want to train a language model in Keras, by this tutorial: https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/我想通过本教程在 Keras 中训练语言 model: https://machinelearningmastery.com/develop-word-ker-based-neural-language-models

My input is composed of: lines num: 4823744 maximum line: 20 Vocabulary Size: 790609 Total Sequences: 2172328 Max Sequence Length: 11我的输入由以下组成:行数:4823744 最大行:20 词汇量:790609 总序列:2172328 最大序列长度:11

As you can see by this lines:如您所见:

num_words = 50
tokenizer = Tokenizer(num_words=num_words, lower=True)
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1

I'm using the tokenizer with num_words=50.我正在使用 num_words=50 的标记器。 The vocab_size is taken from the tokenizer, but it's still the bigger size (790K). vocab_size 取自标记器,但它仍然是更大的大小(790K)。

Therefore this line:因此这一行:

y = to_categorical(y, num_classes=vocab_size)

Causes a memory error.导致 memory 错误。

This is the model definition:这是 model 定义:

model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))

How can I deal with it?我该如何处理?

I do want to have word-level model and not char-level.我确实想要字级 model 而不是字符级。 And I do want to take at least 10K of the most common words.而且我确实想提取至少 10K 最常用的单词。

I thought about filtering words before hand, but it may cause the language model to learn false sequences.我之前考虑过过滤单词,但这可能会导致语言 model 学习到错误的序列。

How can I solve it?我该如何解决?

Thanks谢谢

Fasttext is a better way to compute embeddings for large vocabularies - it does not need a dictionary entry for every word. Fasttext 是一种更好的计算大型词汇嵌入的方法——它不需要每个单词都有一个字典条目。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么 Keras 嵌入层的矩阵是 vocab_size + 1 的大小? - Why in Keras embedding layer's matrix is a size of vocab_size + 1? 为什么 Keras 嵌入层的 input_dim = vocab_size + 1 - Why Keras Embedding layer's input_dim = vocab_size + 1 训练深度学习模型时如何处理大型csv文件? - How to deal with large csv file when training a deep learning model? 为什么带有未知标记的 keras Tokenizer 要求嵌入的 input_dim 为 vocab_size +2,而不是vocab_size+1 - Why keras Tokenizer with unknown token requiring embedding's input_dim to be vocab_size +2, instead of vocal_size+1 在Keras中训练多元回归模型时损失值非常大 - Very large loss values when training multiple regression model in Keras Keras 中的嵌入层:Vocab Size +1 - Embedding Layer in Keras: Vocab Size +1 如何处理CNN训练Keras的数千张图像 - How to deal with thousands of images for CNN training Keras 在训练keras模型时,尺寸如何工作? - How does the dimensions work when training a keras model? 训练keras model时GPU的性能应该是多少? - How much should GPU performance be when training keras model? 在 TPU 上训练时,如何在 tf keras 中保存 model 权重? - How to save model weights in tf keras when training on TPUs?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM