[英]Using pre-trained word embeddings in a keras model?
I was following this github
code from keras team on how to use pre-trained word embeddings. 我正在关注如何使用预先训练过的单词嵌入来自keras团队的这个
github
代码。 I was able to understand most of it but I've a doubt regarding vector sizes. 我能够理解其中的大部分,但我对矢量大小有疑问。 I was hoping someone could help me out.
我希望有人可以帮助我。
First we define Tokenizer(num_words=MAX_NUM_WORDS)
首先我们定义
Tokenizer(num_words=MAX_NUM_WORDS)
Accoding to keras docs for Tokenizer()
num_words argument only consider MAX_NUM_WORDS - 1
so if MAX_NUM_WORDS=20000
I'll have around 19999
words. 为
Tokenizer()
num_words参数编写 keras文档只考虑MAX_NUM_WORDS - 1
因此如果MAX_NUM_WORDS=20000
我将有大约19999
字。
num_words : the maximum number of words to keep, based on word frequency.
num_words :基于字频率保留的最大字数。 Only the most common num_words-1 words will be kept.
只保留最常见的num_words-1个单词。
Next in the code we prepare a Embedding Matrix
based on glove vectors. 接下来在代码中我们准备了一个基于手套矢量的
Embedding Matrix
。 When doing that we are considering an matrix of size (20001, 100) np.zeros((MAX_NUM_WORDS+1, 100))
. 当我们这样做时,我们正在考虑一个大小(
np.zeros((MAX_NUM_WORDS+1, 100))
) np.zeros((MAX_NUM_WORDS+1, 100))
的矩阵。 I couldn't get why we are consider a matrix of 20001
if we have only 19999
words in our vocabulary. 如果我们的词汇中只有
19999
单词,我无法理解为什么我们考虑20001
的矩阵。
Also then we are passing num_words
to the Embedding layer. 然后我们将
num_words
传递给嵌入层。 According to Embedding layer docs for input_dim argument, It says, 根据input_dim参数的嵌入层文档,它说,
input_dim : int > 0. Size of the vocabulary, ie maximum integer index + 1.
input_dim :int> 0.词汇表的大小,即最大整数索引+ 1。
embedding_layer = Embedding(input_dim=num_words,
output_dim=EMBEDDING_DIM,
embeddings_initializer=Constant(embedding_matrix),
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
Here our vocabulary size will be 19999
according to Tokenizer()
function right? 根据
Tokenizer()
函数,我们的词汇量大19999
是19999
吗? So why are we passing 20001
as input_dim
那么我们为什么
input_dim
20001
作为input_dim
传递
Here's a small snippet of the code taken from that github link. 这是从github链接中获取的一小段代码。
MAX_NUM_WORDS = 20000
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIR = 100
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
# prepare embedding matrix
num_words = MAX_NUM_WORDS + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
if i > MAX_NUM_WORDS:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
embedding_layer = Embedding(num_words,
EMBEDDING_DIM,
embeddings_initializer=Constant(embedding_matrix),
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
For the embedding, input dim (num_words in the below code) is the size of the vocabulary. 对于嵌入,输入dim(下面代码中的num_words)是词汇表的大小。 For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
例如,如果您的数据是整数编码为0-10之间的值,那么词汇表的大小将是11个单词。 That is the reason 1 is added to the min of len(word_index) and MAX_NUM_WORDS.
这就是将1添加到len(word_index)和MAX_NUM_WORDS的最小值的原因。
Embedding matrix will have the the dimension of vocabulary size and vector length 嵌入矩阵将具有词汇量大小和向量长度的维度
embedding_layer = Embedding(num_words,
EMBEDDING_DIM,
embeddings_initializer=Constant(embedding_matrix),
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
Have created a simple tokenizer to explain this. 创建了一个简单的tokenizer来解释这一点。
t = Tokenizer(num_words=5)
fit_text = ["The earth is an awesome place live"]
t.fit_on_texts(fit_text)
word_index = t.word_index
print('word_index : ',word_index)
print('len word_index : ',len(t.word_index))
word_index : {'the': 1, 'earth': 2, 'is': 3, 'an': 4, 'awesome': 5, 'place': 6, 'live': 7}
len word_index : 7
In the below case, you are covering a vocabulary of size 4 only because tokenizer indexing starts from 1. 在下面的例子中,您只覆盖大小为4的词汇表,因为tokenizer索引从1开始。
embedding_matrix = np.zeros((5, 10))
embedding_matrix
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
for word, i in word_index.items():
if i < 5:
embedding_matrix[i] = [0,1,0,0,0,0,0,0,0,0]
print (embedding_matrix)
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]
In the below case, you need to add 1 (5+1) to cover the vocabulary of size 5 to cover for the index 0 在下面的情况中,您需要添加1(5 + 1)来覆盖大小为5的词汇表以覆盖索引0
embedding_matrix = np.zeros((6, 10))
for word, i in word_index.items():
if i < 6:
embedding_matrix[i] = [0,1,0,0,0,0,0,0,0,0]
print (embedding_matrix)
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]
I think your doubt is valid. 我认为你的疑问是有效的。 The change was made in this commit of the code to keep the word with
index = MAX_NUM_WORDS
. 在代码的提交中进行了更改,以保持单词的
index = MAX_NUM_WORDS
。 Before that there was a commit on Tokenizer
to make it keep num_words
words instead of num_words - 1
words. 在此之前,
Tokenizer
上有一个提交 ,使其保留num_words
字而不是num_words - 1
字。 But this change of Tokenizer
was reverted afterwards. 但后来这个
Tokenizer
变化又被恢复了。 So I guess the author of the example update might have assumed that Tokenizer
kept num_words
words when the update was committed. 所以我猜测示例更新的作者可能假设
Tokenizer
在num_words
更新时保留了num_words
字。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.