在 TensorFlow 中使用预训练的词嵌入（word2vec 或 Glove）

Question

I've recently reviewed an interesting implementation for convolutional text classification .我最近回顾了一个有趣的卷积文本分类实现。 However all TensorFlow code I've reviewed uses a random (not pre-trained) embedding vectors like the following:但是，我查看过的所有 TensorFlow 代码都使用随机（未预训练）嵌入向量，如下所示：

with tf.device('/cpu:0'), tf.name_scope("embedding"):
    W = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="W")
    self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
    self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

Does anybody know how to use the results of Word2vec or a GloVe pre-trained word embedding instead of a random one?有人知道如何使用 Word2vec 或 GloVe 预训练词嵌入的结果而不是随机词嵌入吗？

Answer 1

There are a few ways that you can use a pre-trained embedding in TensorFlow.有几种方法可以在 TensorFlow 中使用预训练的嵌入。 Let's say that you have the embedding in a NumPy array called embedding , with vocab_size rows and embedding_dim columns and you want to create a tensor W that can be used in a call to tf.nn.embedding_lookup() .假设您在一个名为embedding的 NumPy 数组中embedding了vocab_size行和embedding_dim列，并且您想创建一个张量W可用于调用tf.nn.embedding_lookup() 。

Simply create W as a tf.constant() that takes embedding as its value:只需将W创建为tf.constant() ，将embedding作为其值：
```
 W = tf.constant(embedding, name="W")
```
This is the easiest approach, but it is not memory efficient because the value of a tf.constant() is stored multiple times in memory.这是最简单的方法，但它的内存效率不高，因为tf.constant()的值在内存中存储了多次。 Since embedding can be very large, you should only use this approach for toy examples.由于embedding可能非常大，因此您应该仅将这种方法用于玩具示例。
Create W as a tf.Variable and initialize it from the NumPy array via atf.placeholder() :创建W作为tf.Variable并通过tf.placeholder()从 NumPy 数组初始化它：
```
 W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]), trainable=False, name="W") embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim]) embedding_init = W.assign(embedding_placeholder) # ... sess = tf.Session() sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
```
This avoid storing a copy of embedding in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable ).这避免了在图中存储embedding的副本，但它确实需要足够的内存来同时在内存中保存矩阵的两个副本（一个用于 NumPy 数组，一个用于tf.Variable ）。 Note that I've assumed that you want to hold the embedding matrix constant during training, so W is created with trainable=False .请注意，我假设您希望在训练期间保持嵌入矩阵不变，因此W是使用trainable=False创建的。
If the embedding was trained as part of another TensorFlow model, you can use a tf.train.Saver to load the value from the other model's checkpoint file.如果嵌入是作为另一个 TensorFlow 模型的一部分进行训练的，您可以使用tf.train.Saver从另一个模型的检查点文件加载值。 This means that the embedding matrix can bypass Python altogether.这意味着嵌入矩阵可以完全绕过 Python。 Create W as in option 2, then do the following:在选项 2 中创建W ，然后执行以下操作：
```
 W = tf.Variable(...) embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W}) # ... sess = tf.Session() embedding_saver.restore(sess, "checkpoint_filename.ckpt")
```

Answer 2

我使用这种方法来加载和共享嵌入。

W = tf.get_variable(name="W", shape=embedding.shape, initializer=tf.constant_initializer(embedding), trainable=False)

Answer 3

2.0 Compatible Answer : There are many Pre-Trained Embeddings, which are developed by Google and which have been Open Sourced. 2.0 兼容答案：有许多预训练嵌入，由谷歌开发并已开源。

Some of them are Universal Sentence Encoder (USE), ELMO, BERT , etc.. and it is very easy to reuse them in your code.其中一些是Universal Sentence Encoder (USE), ELMO, BERT等等，而且很容易在你的代码中重用它们。

Code to reuse the Pre-Trained Embedding , Universal Sentence Encoder is shown below:重用Pre-Trained Embedding Universal Sentence Encoder的代码如下所示：

  !pip install "tensorflow_hub>=0.6.0"
  !pip install "tensorflow>=2.0.0"

  import tensorflow as tf
  import tensorflow_hub as hub

  module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
  embed = hub.KerasLayer(module_url)
  embeddings = embed(["A long sentence.", "single-word",
                      "http://example.com"])
  print(embeddings.shape)  #(3,128)

For more information the Pre-Trained Embeddings developed and open-sourced by Google, refer TF Hub Link .有关由 Google 开发和开源的 Pre-Trained Embeddings 的更多信息，请参阅TF Hub Link 。

Answer 4

The answer of @mrry is not right because it provoques the overwriting of the embeddings weights each the network is run, so if you are following a minibatch approach to train your network, you are overwriting the weights of the embeddings. @mrry 的答案是不正确的，因为它在每次运行网络时都会引起对嵌入权重的覆盖，因此如果您采用小批量方法来训练您的网络，则会覆盖嵌入的权重。 So, on my point of view the right way to pre-trained embeddings is:因此，在我看来，预训练嵌入的正确方法是：

embeddings = tf.get_variable("embeddings", shape=[dim1, dim2], initializer=tf.constant_initializer(np.array(embeddings_matrix))

Answer 5

With tensorflow version 2 its quite easy if you use the Embedding layer如果您使用 Embedding 层，那么使用 tensorflow 版本 2 就很容易了

X=tf.keras.layers.Embedding(input_dim=vocab_size,
                            output_dim=300,
                            input_length=Length_of_input_sequences,
                            embeddings_initializer=matrix_of_pretrained_weights
                            )(ur_inp)

Answer 6

I was also facing embedding issue, So i wrote detailed tutorial with dataset.我也面临嵌入问题，所以我用数据集写了详细的教程。 Here I would like to add what I tried You can also try this method,这里我想补充一下我试过的你也可以试试这个方法，

import tensorflow as tf

tf.reset_default_graph()

input_x=tf.placeholder(tf.int32,shape=[None,None])

#you have to edit shape according to your embedding size


Word_embedding = tf.get_variable(name="W", shape=[400000,100], initializer=tf.constant_initializer(np.array(word_embedding)), trainable=False)
embedding_loopup= tf.nn.embedding_lookup(Word_embedding,input_x)

with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for ii in final_:
            print(sess.run(embedding_loopup,feed_dict={input_x:[ii]}))

Here is working detailed Tutorial Ipython example if you want to understand from scratch , take a look .这里是详细的教程Ipython 示例，如果您想从头开始理解，请看一看。

在 TensorFlow 中使用预训练的词嵌入（word2vec 或 Glove）

问题描述

6 个解决方案

解决方案1
131 已采纳 2016-02-28 20:59:12

解决方案2
33 2016-04-27 03:58:22

解决方案3
10 2020-01-08 12:23:02

解决方案4
6 2016-10-24 09:25:01

解决方案5
5 2020-02-21 01:29:59

解决方案6
3 2018-04-11 15:59:13

在 TensorFlow 中使用预训练的词嵌入（word2vec 或 Glove）

问题描述

6 个解决方案

解决方案1 131 已采纳 2016-02-28 20:59:12

解决方案2 33 2016-04-27 03:58:22

解决方案3 10 2020-01-08 12:23:02

解决方案4 6 2016-10-24 09:25:01

解决方案5 5 2020-02-21 01:29:59

解决方案6 3 2018-04-11 15:59:13

解决方案1
131 已采纳 2016-02-28 20:59:12

解决方案2
33 2016-04-27 03:58:22

解决方案3
10 2020-01-08 12:23:02

解决方案4
6 2016-10-24 09:25:01

解决方案5
5 2020-02-21 01:29:59

解决方案6
3 2018-04-11 15:59:13