简体   繁体   English

如何在Gensim的Word2Vec中正确使用get_keras_embedding()?

[英]How to properly use get_keras_embedding() in Gensim’s Word2Vec?

I am trying to build a translation network using embedding and RNN. 我正在尝试使用嵌入和RNN构建翻译网络。 I have trained a Gensim Word2Vec model and it is learning word associations pretty well. 我已经训练了一个Gensim Word2Vec模型,它正在学习单词关联。 However, I couldn't get my head around how to properly add the layer to a Keras model. 但是,我无法理解如何将图层正确添加到Keras模型中。 (And how to do an 'inverse embedding' for the output. But that's another question that had been answered: by default you can't.) (以及如何为输出执行'反向嵌入'。但这是另一个已经回答的问题:默认情况下你不能。)

In Word2Vec, when you input a string, eg model.wv['hello'] , you get a vector representation of the word. 在Word2Vec中,当您输入一个字符串,例如model.wv['hello'] ,您将获得该单词的向量表示。 However, I believe that the keras.layers.Embedding layer returned by Word2Vec's get_keras_embedding() takes a one-hot/tokenized input, instead of a string input. 但是,我相信Word2Vec的get_keras_embedding()返回的keras.layers.Embedding层采用单热/标记化输入,而不是字符串输入。 But the documentation provides no explanation on what the appropriate input is. 但是文档没有说明适当的输入是什么。 I cannot figure out how to obtain the one-hot/tokenized vector of the vocabulary that has 1-to-1 correspondence with the Embedding layer's input. 我无法弄清楚如何获得与嵌入层输入一一对应的词汇表的单热/标记化向量。

More elaboration below: 更详细说明如下:

Currently my workaround is to apply the embedding outside Keras before feeding it to the network. 目前我的解决方法是在将其嵌入到网络之前应用嵌入Keras之外的嵌入。 Is there any detriment in doing this? 这样做有什么不利吗? I will set the embedding to non-trainable anyway. 无论如何,我会将嵌入设置为不可训练的。 So far I have noticed that memory use is extremely inefficient (like 50GB even before declaring the Keras model for a collection of 64-word-long sentences) having to load the padded inputs and the weights outside the model. 到目前为止,我已经注意到内存使用效率非常低(例如50GB甚至在为64字长句的集合声明Keras模型之前)必须加载填充的输入和模型外的权重。 Maybe generator can help. 也许发电机可以帮忙。

The following is my code. 以下是我的代码。 Inputs are padded to 64-words long. 输入填充为64字长。 The Word2Vec embedding has 300 dimensions. Word2Vec嵌入有300个维度。 There are probably a lot of mistakes here due to repeated experimentation trying to make embedding work. 由于重复实验试图进行嵌入工作,这里可能存在很多错误。 Suggestions are welcome. 欢迎提出建议。

import gensim
word2vec_model = gensim.models.Word2Vec.load(“word2vec.model")
from keras.models import Sequential
from keras.layers import Embedding, GRU, Input, Flatten, Dense, TimeDistributed, Activation, PReLU, RepeatVector, Bidirectional, Dropout
from keras.optimizers import Adam, Adadelta
from keras.callbacks import ModelCheckpoint
from keras.losses import sparse_categorical_crossentropy, mean_squared_error, cosine_proximity

keras_model = Sequential()
keras_model.add(word2vec_model.wv.get_keras_embedding(train_embeddings=False))
keras_model.add(Bidirectional(GRU(300, return_sequences=True, dropout=0.1, recurrent_dropout=0.1, activation='tanh')))
keras_model.add(TimeDistributed(Dense(600, activation='tanh')))
# keras_model.add(PReLU())
# ^ For some reason I get error when I add Activation ‘outside’:
# int() argument must be a string, a bytes-like object or a number, not 'NoneType'
# But keras_model.add(Activation('relu')) works.
keras_model.add(Dense(source_arr.shape[1] * source_arr.shape[2]))
# size = max-output-sentence-length * embedding-dimensions to learn the embedding vector and find the nearest word in word2vec_model.wv.similar_by_vector() afterwards.
# Alternatively one can use Dense(vocab_size) and train the network to output one-hot categorical words instead.
# Remember to change Keras loss to sparse_categorical_crossentropy.
# But this won’t benefit from Word2Vec. 

keras_model.compile(loss=mean_squared_error,
              optimizer=Adadelta(),
              metrics=['mean_absolute_error'])
keras_model.summary()
_________________________________________________________________ 
Layer (type)                 Output Shape              Param #   
================================================================= 
embedding_19 (Embedding)     (None, None, 300)         8219700   
_________________________________________________________________ 
bidirectional_17 (Bidirectio (None, None, 600)         1081800   
_________________________________________________________________ 
activation_4 (Activation)    (None, None, 600)         0         
_________________________________________________________________ 
time_distributed_17 (TimeDis (None, None, 600)         360600    
_________________________________________________________________ 
dense_24 (Dense)             (None, None, 19200)       11539200  
================================================================= 
Total params: 21,201,300 
Trainable params: 12,981,600 
Non-trainable params: 8,219,700
_________________________________________________________________
filepath="best-weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_absolute_error', verbose=1, save_best_only=True, mode='auto')
callbacks_list = [checkpoint]
keras_model.fit(array_of_word_lists, array_of_word_lists_AFTER_being_transformed_by_word2vec, epochs=100, batch_size=2000, shuffle=True, callbacks=callbacks_list, validation_split=0.2)

Which throws an error when I try to fit the model with text: 当我尝试使用文本拟合模型时会引发错误:

Train on 8000 samples, validate on 2000 samples Epoch 1/100

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-865f8b75fbc3> in <module>()
      2 checkpoint = ModelCheckpoint(filepath, monitor='val_mean_absolute_error', verbose=1, save_best_only=True, mode='auto')
      3 callbacks_list = [checkpoint]
----> 4 keras_model.fit(array_of_word_lists, array_of_word_lists_AFTER_being_transformed_by_word2vec, epochs=100, batch_size=2000, shuffle=True, callbacks=callbacks_list, validation_split=0.2)

~/virtualenv/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1040                                         initial_epoch=initial_epoch,
   1041                                         steps_per_epoch=steps_per_epoch,
-> 1042                                         validation_steps=validation_steps)
   1043 
   1044     def evaluate(self, x=None, y=None,

~/virtualenv/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
    197                     ins_batch[i] = ins_batch[i].toarray()
    198 
--> 199                 outs = f(ins_batch)
    200                 if not isinstance(outs, list):
    201                     outs = [outs]

~/virtualenv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2659                 return self._legacy_call(inputs)
   2660 
-> 2661             return self._call(inputs)
   2662         else:
   2663             if py_any(is_tensor(x) for x in inputs):

~/virtualenv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
   2612                 array_vals.append(
   2613                     np.asarray(value,
-> 2614                                dtype=tensor.dtype.base_dtype.name))
   2615         if self.feed_dict:
   2616             for key in sorted(self.feed_dict.keys()):

~/virtualenv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    490 
    491     """
--> 492     return array(a, dtype, copy=False, order=order)
    493 
    494 

ValueError: could not convert string to float: 'hello'

The following is an excerpt from Rajmak demonstrating how to use a tokenizer to convert words into the input of a Keras Embedding. 以下是Rajmak的摘录,演示如何使用标记器将单词转换为Keras嵌入的输入。

tokenizer = Tokenizer(num_words=MAX_NB_WORDS) 
tokenizer.fit_on_texts(all_texts) 
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
……
indices = np.arange(data.shape[0]) # get sequence of row index 
np.random.shuffle(indices) # shuffle the row indexes 
data = data[indices] # shuffle data/product-titles/x-axis
……
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0]) 
x_train = data[:-nb_validation_samples]
……
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

Keras embedding layer can be obtained by Gensim Word2Vec's word2vec.get_keras_embedding(train_embeddings=False) method or constructed like shown below. Keras嵌入层可以通过Gensim Word2Vec的word2vec.get_keras_embedding(train_embeddings = False)方法获得,或者如下所示构造。 The null word embeddings indicate the number of words not found in our pre-trained vectors (In this case Google News). 空单词嵌入表示在我们预先训练的向量中未找到的单词数(在本例中为Google新闻)。 This could possibly be unique words for brands in this context. 在这种情况下,这可能是品牌的独特词汇。

from keras.layers import Embedding
word_index = tokenizer.word_index
nb_words = min(MAX_NB_WORDS, len(word_index))+1

embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if word in word2vec.vocab:
        embedding_matrix[i] = word2vec.word_vec(word)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

embedding_layer = Embedding(embedding_matrix.shape[0], # or len(word_index) + 1
                            embedding_matrix.shape[1], # or EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Flatten
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation

model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(300, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(150, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(75, 3, padding='valid',activation='relu',strides=2))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(150,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='sigmoid'))

model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])

model.summary()

model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128)
score = model.evaluate(x_val, y_val, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Here the embedding_layer is explicitly created using: 这里使用以下内容显式创建embedding_layer

for word, i in word_index.items():
    if word in word2vec.vocab:
        embedding_matrix[i] = word2vec.word_vec(word)

However, if we use get_keras_embedding() , the embedding matrix is already constructed and fixed. 但是,如果我们使用get_keras_embedding() ,则嵌入矩阵已经构造并修复。 I do not know how each word_index in the Tokenizer can be coerced match the corresponding word in get_keras_embedding() 's Keras embedding input. 我不知道Tokenizer中的每个word_index如何被强制匹配get_keras_embedding()get_keras_embedding()嵌入输入中的相应字。

So, what is the proper way to use Word2Vec's get_keras_embedding() in Keras? 那么,在Keras中使用Word2Vec的get_keras_embedding()的正确方法是什么?

So I've found the solution. 所以我找到了解决方案。 The Tokenized word index can be found in word2vec_model.wv.vocab[word].index and the converse can be obtained by word2vec_model.wv.index2word[word_index] . word2vec_model.wv.vocab[word].index词索引可以在word2vec_model.wv.vocab[word].index找到,并且逆向可以通过word2vec_model.wv.index2word[word_index] get_keras_embedding() takes the former as input. get_keras_embedding()将前者作为输入。

I do the conversion as follows: 我按如下方式进行转换:

source_word_indices = []
for i in range(len(array_of_word_lists)):
    source_word_indices.append([])
    for j in range(len(array_of_word_lists[i])):
        word = array_of_word_lists[i][j]
        if word in word2vec_model.wv.vocab:
            word_index = word2vec_model.wv.vocab[word].index
            source_word_indices[i].append(word_index)
        else:
            # Do something. For example, leave it blank or replace with padding character's index.
            source_word_indices[i].append(padding_index)
source = numpy.array(source_word_indices)

Then finally keras_model.fit(source, ... 最后是keras_model.fit(source, ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM