简体   繁体   English

在keras模型中使用预先训练的单词嵌入?

[英]Using pre-trained word embeddings in a keras model?

I was following this github code from keras team on how to use pre-trained word embeddings. 我正在关注如何使用预先训练过的单词嵌入来自keras团队的这个github代码。 I was able to understand most of it but I've a doubt regarding vector sizes. 我能够理解其中的大部分,但我对矢量大小有疑问。 I was hoping someone could help me out. 我希望有人可以帮助我。

First we define Tokenizer(num_words=MAX_NUM_WORDS) 首先我们定义Tokenizer(num_words=MAX_NUM_WORDS)

Accoding to keras docs for Tokenizer() num_words argument only consider MAX_NUM_WORDS - 1 so if MAX_NUM_WORDS=20000 I'll have around 19999 words. Tokenizer() num_words参数编写 keras文档只考虑MAX_NUM_WORDS - 1因此如果MAX_NUM_WORDS=20000我将有大约19999字。

num_words : the maximum number of words to keep, based on word frequency. num_words :基于字频率保留的最大字数。 Only the most common num_words-1 words will be kept. 只保留最常见的num_words-1个单词。

Next in the code we prepare a Embedding Matrix based on glove vectors. 接下来在代码中我们准备了一个基于手套矢量的Embedding Matrix When doing that we are considering an matrix of size (20001, 100) np.zeros((MAX_NUM_WORDS+1, 100)) . 当我们这样做时,我们正在考虑一个大小( np.zeros((MAX_NUM_WORDS+1, 100))np.zeros((MAX_NUM_WORDS+1, 100))的矩阵。 I couldn't get why we are consider a matrix of 20001 if we have only 19999 words in our vocabulary. 如果我们的词汇中只有19999单词,我无法理解为什么我们考虑20001的矩阵。

Also then we are passing num_words to the Embedding layer. 然后我们将num_words传递给嵌入层。 According to Embedding layer docs for input_dim argument, It says, 根据input_dim参数的嵌入层文档,它说,

input_dim : int > 0. Size of the vocabulary, ie maximum integer index + 1. input_dim :int> 0.词汇表的大小,即最大整数索引+ 1。

embedding_layer = Embedding(input_dim=num_words,
                            output_dim=EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
trainable=False)

Here our vocabulary size will be 19999 according to Tokenizer() function right? 根据Tokenizer()函数,我们的词汇量大1999919999吗? So why are we passing 20001 as input_dim 那么我们为什么input_dim 20001作为input_dim传递

Here's a small snippet of the code taken from that github link. 这是从github链接中获取的一小段代码。

MAX_NUM_WORDS = 20000
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIR = 100

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

# prepare embedding matrix
num_words = MAX_NUM_WORDS + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

For the embedding, input dim (num_words in the below code) is the size of the vocabulary. 对于嵌入,输入dim(下面代码中的num_words)是词汇表的大小。 For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words. 例如,如果您的数据是整数编码为0-10之间的值,那么词汇表的大小将是11个单词。 That is the reason 1 is added to the min of len(word_index) and MAX_NUM_WORDS. 这就是将1添加到len(word_index)和MAX_NUM_WORDS的最小值的原因。

Embedding matrix will have the the dimension of vocabulary size and vector length 嵌入矩阵将具有词汇量大小和向量长度的维度

embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

num_words = min(MAX_NUM_WORDS, len(word_index)) + 1

Have created a simple tokenizer to explain this. 创建了一个简单的tokenizer来解释这一点。

t  = Tokenizer(num_words=5)
fit_text = ["The earth is an awesome place live"]
t.fit_on_texts(fit_text)
word_index = t.word_index
​
print('word_index : ',word_index)
print('len word_index : ',len(t.word_index))
word_index :  {'the': 1, 'earth': 2, 'is': 3, 'an': 4, 'awesome': 5, 'place': 6, 'live': 7}
len word_index :  7

In the below case, you are covering a vocabulary of size 4 only because tokenizer indexing starts from 1. 在下面的例子中,您只覆盖大小为4的词汇表,因为tokenizer索引从1开始。

embedding_matrix = np.zeros((5, 10))
embedding_matrix
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

for word, i in word_index.items():
    if i < 5:       
        embedding_matrix[i] = [0,1,0,0,0,0,0,0,0,0]

print (embedding_matrix)
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]

In the below case, you need to add 1 (5+1) to cover the vocabulary of size 5 to cover for the index 0 在下面的情况中,您需要添加1(5 + 1)来覆盖大小为5的词汇表以覆盖索引0

embedding_matrix = np.zeros((6, 10))
for word, i in word_index.items():
    if i < 6:       
        embedding_matrix[i] = [0,1,0,0,0,0,0,0,0,0]

print (embedding_matrix)

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]

I think your doubt is valid. 我认为你的疑问是有效的。 The change was made in this commit of the code to keep the word with index = MAX_NUM_WORDS . 在代码的提交中进行了更改,以保持单词的index = MAX_NUM_WORDS Before that there was a commit on Tokenizer to make it keep num_words words instead of num_words - 1 words. 在此之前, Tokenizer上有一个提交 ,使其保留num_words字而不是num_words - 1字。 But this change of Tokenizer was reverted afterwards. 但后来这个Tokenizer变化恢复了。 So I guess the author of the example update might have assumed that Tokenizer kept num_words words when the update was committed. 所以我猜测示例更新的作者可能假设Tokenizernum_words更新时保留了num_words字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM