简体   繁体   English

在 TensorFlow 中使用预训练的词嵌入(word2vec 或 Glove)

[英]Using a pre-trained word embedding (word2vec or Glove) in TensorFlow

I've recently reviewed an interesting implementation for convolutional text classification .我最近回顾了一个有趣的卷积文本分类实现 However all TensorFlow code I've reviewed uses a random (not pre-trained) embedding vectors like the following:但是,我查看过的所有 TensorFlow 代码都使用随机(未预训练)嵌入向量,如下所示:

with tf.device('/cpu:0'), tf.name_scope("embedding"):
    W = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="W")
    self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
    self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

Does anybody know how to use the results of Word2vec or a GloVe pre-trained word embedding instead of a random one?有人知道如何使用 Word2vec 或 GloVe 预训练词嵌入的结果而不是随机词嵌入吗?

There are a few ways that you can use a pre-trained embedding in TensorFlow.有几种方法可以在 TensorFlow 中使用预训练的嵌入。 Let's say that you have the embedding in a NumPy array called embedding , with vocab_size rows and embedding_dim columns and you want to create a tensor W that can be used in a call to tf.nn.embedding_lookup() .假设您在一个名为embedding的 NumPy 数组中embeddingvocab_size行和embedding_dim列,并且您想创建一个张量W可用于调用tf.nn.embedding_lookup()

  1. Simply create W as a tf.constant() that takes embedding as its value:只需将W创建为tf.constant() ,将embedding作为其值:

     W = tf.constant(embedding, name="W")

    This is the easiest approach, but it is not memory efficient because the value of a tf.constant() is stored multiple times in memory.这是最简单的方法,但它的内存效率不高,因为tf.constant()的值在内存中存储了多次。 Since embedding can be very large, you should only use this approach for toy examples.由于embedding可能非常大,因此您应该仅将这种方法用于玩具示例。

  2. Create W as a tf.Variable and initialize it from the NumPy array via atf.placeholder() :创建W作为tf.Variable并通过tf.placeholder()从 NumPy 数组初始化它:

     W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]), trainable=False, name="W") embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim]) embedding_init = W.assign(embedding_placeholder) # ... sess = tf.Session() sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})

    This avoid storing a copy of embedding in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable ).这避免了在图中存储embedding的副本,但它确实需要足够的内存来同时在内存中保存矩阵的两个副本(一个用于 NumPy 数组,一个用于tf.Variable )。 Note that I've assumed that you want to hold the embedding matrix constant during training, so W is created with trainable=False .请注意,我假设您希望在训练期间保持嵌入矩阵不变,因此W是使用trainable=False创建的。

  3. If the embedding was trained as part of another TensorFlow model, you can use a tf.train.Saver to load the value from the other model's checkpoint file.如果嵌入是作为另一个 TensorFlow 模型的一部分进行训练的,您可以使用tf.train.Saver从另一个模型的检查点文件加载值。 This means that the embedding matrix can bypass Python altogether.这意味着嵌入矩阵可以完全绕过 Python。 Create W as in option 2, then do the following:在选项 2 中创建W ,然后执行以下操作:

     W = tf.Variable(...) embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W}) # ... sess = tf.Session() embedding_saver.restore(sess, "checkpoint_filename.ckpt")

我使用这种方法来加载和共享嵌入。

W = tf.get_variable(name="W", shape=embedding.shape, initializer=tf.constant_initializer(embedding), trainable=False)

2.0 Compatible Answer : There are many Pre-Trained Embeddings, which are developed by Google and which have been Open Sourced. 2.0 兼容答案:有许多预训练嵌入,由谷歌开发并已开源。

Some of them are Universal Sentence Encoder (USE), ELMO, BERT , etc.. and it is very easy to reuse them in your code.其中一些是Universal Sentence Encoder (USE), ELMO, BERT等等,而且很容易在你的代码中重用它们。

Code to reuse the Pre-Trained Embedding , Universal Sentence Encoder is shown below:重用Pre-Trained Embedding Universal Sentence Encoder的代码如下所示:

  !pip install "tensorflow_hub>=0.6.0"
  !pip install "tensorflow>=2.0.0"

  import tensorflow as tf
  import tensorflow_hub as hub

  module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
  embed = hub.KerasLayer(module_url)
  embeddings = embed(["A long sentence.", "single-word",
                      "http://example.com"])
  print(embeddings.shape)  #(3,128)

For more information the Pre-Trained Embeddings developed and open-sourced by Google, refer TF Hub Link .有关由 Google 开发和开源的 Pre-Trained Embeddings 的更多信息,请参阅TF Hub Link

The answer of @mrry is not right because it provoques the overwriting of the embeddings weights each the network is run, so if you are following a minibatch approach to train your network, you are overwriting the weights of the embeddings. @mrry 的答案是不正确的,因为它在每次运行网络时都会引起对嵌入权重的覆盖,因此如果您采用小批量方法来训练您的网络,则会覆盖嵌入的权重。 So, on my point of view the right way to pre-trained embeddings is:因此,在我看来,预训练嵌入的正确方法是:

embeddings = tf.get_variable("embeddings", shape=[dim1, dim2], initializer=tf.constant_initializer(np.array(embeddings_matrix))

With tensorflow version 2 its quite easy if you use the Embedding layer如果您使用 Embedding 层,那么使用 tensorflow 版本 2 就很容易了

X=tf.keras.layers.Embedding(input_dim=vocab_size,
                            output_dim=300,
                            input_length=Length_of_input_sequences,
                            embeddings_initializer=matrix_of_pretrained_weights
                            )(ur_inp)

I was also facing embedding issue, So i wrote detailed tutorial with dataset.我也面临嵌入问题,所以我用数据集写了详细的教程。 Here I would like to add what I tried You can also try this method,这里我想补充一下我试过的你也可以试试这个方法,

import tensorflow as tf

tf.reset_default_graph()

input_x=tf.placeholder(tf.int32,shape=[None,None])

#you have to edit shape according to your embedding size


Word_embedding = tf.get_variable(name="W", shape=[400000,100], initializer=tf.constant_initializer(np.array(word_embedding)), trainable=False)
embedding_loopup= tf.nn.embedding_lookup(Word_embedding,input_x)

with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for ii in final_:
            print(sess.run(embedding_loopup,feed_dict={input_x:[ii]}))

Here is working detailed Tutorial Ipython example if you want to understand from scratch , take a look .这里是详细的教程Ipython 示例,如果您想从头开始理解,请看一看。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Keras中将自己的词嵌入与像word2vec这样的预训练嵌入一起使用 - How to use own word embedding with pre-trained embedding like word2vec in Keras Word2Vec:使用 Gensim 上传预训练的 word2vec 文件时收到错误 - Word2Vec: Error received at uploading a pre-trained word2vec file using Gensim Gensim word2vec 扩充或合并预训练向量 - Gensim word2vec augment or merge pre-trained vectors 如何加载预训练的 Word2vec 模型文件? - How to load a pre-trained Word2vec MODEL File? 将预先训练的word2vec向量注入TensorFlow seq2seq - Injecting pre-trained word2vec vectors into TensorFlow seq2seq 通过预先训练的词嵌入(例如 GloVe)使用 LSTM 创建问题的表示 - create representation of questions using LSTM via a pre-trained word embedding such as GloVe Gensim 的 Doc2Vec - 如何使用预训练的 word2vec(词相似性) - Gensim's Doc2Vec - How to use pre-trained word2vec (word similarities) 使用fasttext预训练的单词向量作为在tensorflow脚本中的嵌入 - Use of fasttext Pre-trained word vector as embedding in tensorflow script 如何从word2vec的Google预训练模型中提取单词向量? - How to extract a word vector from the Google pre-trained model for word2vec? 使用LSTM预训练Word2Vec,预测句子中的下一个单词 - pre-trained Word2Vec with LSTM, predict next word in sentence
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM