简体   繁体   English

如何将word2vec导入TensorFlow Seq2Seq模型?

[英]How to import word2vec into TensorFlow Seq2Seq model?

I am playing with Tensorflow sequence to sequence translation model. 我正在玩Tensorflow序列到序列翻译模型。 I was wondering if I could import my own word2vec into this model? 我想知道是否可以将自己的word2vec导入此模型? Rather than using its original 'dense representation' mentioned in the tutorial. 而不是使用本教程中提到的原始“密集表示”。

From my point of view, it looks TensorFlow is using One-Hot representation for seq2seq model. 从我的角度来看,TensorFlow似乎在使用seq2seq模型的一键表示。 Firstly,for function tf.nn.seq2seq.embedding_attention_seq2seq the encoder's input is a tokenized symbol, eg 'a' would be '4' and 'dog' would be '15715' etc. and requires num_encoder_symbols. 首先,对于函数tf.nn.seq2seq.embedding_attention_seq2seq ,编码器的输入是标记符号,例如'a'将为'4','dog'将为'15715'等,并且需要num_encoder_symbols。 So I think it makes me provide the position of the word and the total number of words, then the function could represent the word in One-Hot representation. 因此,我认为这使我能够提供单词的位置和单词的总数,然后该函数可以用一键表示形式表示单词。 I am still learning the source code, but it hard to understand. 我仍在学习源代码,但是很难理解。

Could anyone give me an idea on above problem? 有人可以就上述问题给我个主意吗?

The seq2seq embedding_* functions indeed create embedding matrices very similar to those from word2vec. seq2seq embedding_ *函数的确创建了与word2vec中的嵌入矩阵非常相似的嵌入矩阵。 They are a variable named sth like this: 它们是一个名为sth的变量,如下所示:

EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding" EMBEDDING_KEY =“ embedding_attention_seq2seq / RNN / EmbeddingWrapper / embeddding”

Knowing this, you can just modify this variable. 知道这一点,您只需修改此变量即可。 I mean -- get your word2vec vectors in some format, say a text file. 我的意思是-以某种格式获取word2vec向量,例如说一个文本文件。 Assuming you have your vocabulary in model.vocab you can then assign the read vectors in a way illustrated by the snippet below (it's just a snippet, you'll have to change it to make it work, but I hope it shows the idea). 假设您在model.vocab中使用了词汇表,然后可以按照下面的代码段说明的方式分配读取向量(这只是一个代码段,您必须对其进行更改以使其起作用,但我希望它能显示出这个主意) 。

   vectors_variable = [v for v in tf.trainable_variables()
                        if EMBEDDING_KEY in v.name]
    if len(vectors_variable) != 1:
      print("Word vector variable not found or too many.")
      sys.exit(1)
    vectors_variable = vectors_variable[0]
    vectors = vectors_variable.eval()
    print("Setting word vectors from %s" % FLAGS.word_vector_file)
    with gfile.GFile(FLAGS.word_vector_file, mode="r") as f:
      # Lines have format: dog 0.045123 -0.61323 0.413667 ...
      for line in f:
        line_parts = line.split()
        # The first part is the word.
        word = line_parts[0]
        if word in model.vocab:
          # Remaining parts are components of the vector.
          word_vector = np.array(map(float, line_parts[1:]))
          if len(word_vector) != vec_size:
            print("Warn: Word '%s', Expecting vector size %d, found %d"
                     % (word, vec_size, len(word_vector)))
          else:
            vectors[model.vocab[word]] = word_vector
    # Assign the modified vectors to the vectors_variable in the graph.
    session.run([vectors_variable.initializer],
                {vectors_variable.initializer.inputs[1]: vectors})

I guess with the scope style, which Matthew mentioned, you can get variable: 我猜想使用Matthew提到的范围样式,您可以获取变量:

 with tf.variable_scope("embedding_attention_seq2seq"):
        with tf.variable_scope("RNN"):
            with tf.variable_scope("EmbeddingWrapper", reuse=True):
                  embedding = vs.get_variable("embedding", [shape], [trainable=])

Also, I would imagine you would want to inject embeddings into the decoder as well, the key (or scope) for it would be somthing like: 另外,我想您也希望将嵌入内容注入到解码器中,所以它的键(或作用域)应该是这样的:

"embedding_attention_seq2seq/embedding_attention_decoder/embedding" “embedding_attention_seq2seq / embedding_attention_decoder /嵌入”


Thanks for your answer, Lukasz! 感谢您的回答,卢卡斯!

I was wondering, what exactly in the code snippet <b>model.vocab[word]</b> stands for? 我想知道,代码片段<b>model.vocab[word]</b>到底代表什么? Just the position of the word in the vocabulary? 只是单词在词汇表中的位置?

In this case wouldn't that be faster to iterate through the vocabulary and inject w2v vectors for the words that exist in w2v model. 在这种情况下,遍历词汇并为w2v模型中存在的单词注入w2v向量不会更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM