将预先训练的word2vec向量注入TensorFlow seq2seq

Question

I was trying to inject pretrained word2vec vectors into existing tensorflow seq2seq model. 我试图将预训练的word2vec向量注入现有的tensorflow seq2seq模型。

Following this answer , I produced the following code. 根据这个答案，我制作了以下代码。 But it doesn't seem to improve performance as it should, although the values in the variable are updated. 但它似乎并没有改善性能，尽管变量中的值已更新。

In my understanding the error might be due to the fact that EmbeddingWrapper or embedding_attention_decoder create embeddings independently of the vocabulary order? 根据我的理解，错误可能是因为EmbeddingWrapper或embedding_attention_decoder创建了独立于词汇表顺序的嵌入？

What would be the best way to load pretrained vectors into tensorflow model? 将预训练矢量加载到张量流模型中的最佳方法是什么？

SOURCE_EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"
TARGET_EMBEDDING_KEY = "embedding_attention_seq2seq/embedding_attention_decoder/embedding"


def inject_pretrained_word2vec(session, word2vec_path, input_size, dict_dir, source_vocab_size, target_vocab_size):
  word2vec_model = word2vec.load(word2vec_path, encoding="latin-1")
  print("w2v model created!")
  session.run(tf.initialize_all_variables())

  assign_w2v_pretrained_vectors(session, word2vec_model, SOURCE_EMBEDDING_KEY, source_vocab_path, source_vocab_size)
  assign_w2v_pretrained_vectors(session, word2vec_model, TARGET_EMBEDDING_KEY, target_vocab_path, target_vocab_size)


def assign_w2v_pretrained_vectors(session, word2vec_model, embedding_key, vocab_path, vocab_size):
  vectors_variable = [v for v in tf.trainable_variables() if embedding_key in v.name]
  if len(vectors_variable) != 1:
      print("Word vector variable not found or too many. key: " + embedding_key)
      print("Existing embedding trainable variables:")
      print([v.name for v in tf.trainable_variables() if "embedding" in v.name])
      sys.exit(1)

  vectors_variable = vectors_variable[0]
  vectors = vectors_variable.eval()

  with gfile.GFile(vocab_path, mode="r") as vocab_file:
      counter = 0
      while counter < vocab_size:
          vocab_w = vocab_file.readline().replace("\n", "")
          # for each word in vocabulary check if w2v vector exist and inject.
          # otherwise dont change the value.
          if word2vec_model.__contains__(vocab_w):
              w2w_word_vector = word2vec_model.get_vector(vocab_w)
              vectors[counter] = w2w_word_vector
          counter += 1

  session.run([vectors_variable.initializer],
            {vectors_variable.initializer.inputs[1]: vectors})

Answer 1

I am not familiar with the seq2seq example, but in general you can use the following code snippet to inject your embeddings: 我不熟悉seq2seq示例，但通常您可以使用以下代码段来注入嵌入：

Where you build you graph: 你在哪里建立图表：

with tf.device("/cpu:0"):
  embedding = tf.get_variable("embedding", [vocabulary_size, embedding_size])      
  inputs = tf.nn.embedding_lookup(embedding, input_data)

When you execute (after building your graph and before stating the training), just assign your saved embeddings to the embedding variable: 执行时（在构建图形之后和说明训练之前），只需将保存的嵌入分配给嵌入变量：

session.run(tf.assign(embedding, embeddings_that_you_want_to_use))

The idea is that the embedding_lookup will replace input_data values with those present in the embedding variable. 想法是embedding_lookup将input_data值替换为embedding变量中存在的值。

将预先训练的word2vec向量注入TensorFlow seq2seq

问题描述

1 个解决方案

解决方案1
5 2016-04-04 23:23:36

将预先训练的word2vec向量注入TensorFlow seq2seq

问题描述

1 个解决方案

解决方案1 5 2016-04-04 23:23:36

解决方案1
5 2016-04-04 23:23:36