将预先训练的word2vec向量注入TensorFlow seq2seq

Question

我试图将预训练的word2vec向量注入现有的tensorflow seq2seq模型。

根据这个答案，我制作了以下代码。 但它似乎并没有改善性能，尽管变量中的值已更新。

根据我的理解，错误可能是因为EmbeddingWrapper或embedding_attention_decoder创建了独立于词汇表顺序的嵌入？

将预训练矢量加载到张量流模型中的最佳方法是什么？

SOURCE_EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"
TARGET_EMBEDDING_KEY = "embedding_attention_seq2seq/embedding_attention_decoder/embedding"


def inject_pretrained_word2vec(session, word2vec_path, input_size, dict_dir, source_vocab_size, target_vocab_size):
  word2vec_model = word2vec.load(word2vec_path, encoding="latin-1")
  print("w2v model created!")
  session.run(tf.initialize_all_variables())

  assign_w2v_pretrained_vectors(session, word2vec_model, SOURCE_EMBEDDING_KEY, source_vocab_path, source_vocab_size)
  assign_w2v_pretrained_vectors(session, word2vec_model, TARGET_EMBEDDING_KEY, target_vocab_path, target_vocab_size)


def assign_w2v_pretrained_vectors(session, word2vec_model, embedding_key, vocab_path, vocab_size):
  vectors_variable = [v for v in tf.trainable_variables() if embedding_key in v.name]
  if len(vectors_variable) != 1:
      print("Word vector variable not found or too many. key: " + embedding_key)
      print("Existing embedding trainable variables:")
      print([v.name for v in tf.trainable_variables() if "embedding" in v.name])
      sys.exit(1)

  vectors_variable = vectors_variable[0]
  vectors = vectors_variable.eval()

  with gfile.GFile(vocab_path, mode="r") as vocab_file:
      counter = 0
      while counter < vocab_size:
          vocab_w = vocab_file.readline().replace("\n", "")
          # for each word in vocabulary check if w2v vector exist and inject.
          # otherwise dont change the value.
          if word2vec_model.__contains__(vocab_w):
              w2w_word_vector = word2vec_model.get_vector(vocab_w)
              vectors[counter] = w2w_word_vector
          counter += 1

  session.run([vectors_variable.initializer],
            {vectors_variable.initializer.inputs[1]: vectors})

Answer 1

我不熟悉seq2seq示例，但通常您可以使用以下代码段来注入嵌入：

你在哪里建立图表：

with tf.device("/cpu:0"):
  embedding = tf.get_variable("embedding", [vocabulary_size, embedding_size])      
  inputs = tf.nn.embedding_lookup(embedding, input_data)

执行时（在构建图形之后和说明训练之前），只需将保存的嵌入分配给嵌入变量：

session.run(tf.assign(embedding, embeddings_that_you_want_to_use))

想法是embedding_lookup将input_data值替换为embedding变量中存在的值。

将预先训练的word2vec向量注入TensorFlow seq2seq

问题描述

1 个解决方案

解决方案1
5 2016-04-04 23:23:36

将预先训练的word2vec向量注入TensorFlow seq2seq

问题描述

1 个解决方案

解决方案1 5 2016-04-04 23:23:36

解决方案1
5 2016-04-04 23:23:36