[英]Using a pre-trained word embedding (word2vec or Glove) in TensorFlow
I've recently reviewed an interesting implementation for convolutional text classification .我最近回顾了一个有趣的卷积文本分类实现。 However all TensorFlow code I've reviewed uses a random (not pre-trained) embedding vectors like the following:
但是,我查看过的所有 TensorFlow 代码都使用随机(未预训练)嵌入向量,如下所示:
with tf.device('/cpu:0'), tf.name_scope("embedding"):
W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
name="W")
self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
Does anybody know how to use the results of Word2vec or a GloVe pre-trained word embedding instead of a random one?有人知道如何使用 Word2vec 或 GloVe 预训练词嵌入的结果而不是随机词嵌入吗?
There are a few ways that you can use a pre-trained embedding in TensorFlow.有几种方法可以在 TensorFlow 中使用预训练的嵌入。 Let's say that you have the embedding in a NumPy array called
embedding
, with vocab_size
rows and embedding_dim
columns and you want to create a tensor W
that can be used in a call to tf.nn.embedding_lookup()
.假设您在一个名为
embedding
的 NumPy 数组中embedding
了vocab_size
行和embedding_dim
列,并且您想创建一个张量W
可用于调用tf.nn.embedding_lookup()
。
Simply create W
as a tf.constant()
that takes embedding
as its value:只需将
W
创建为tf.constant()
,将embedding
作为其值:
W = tf.constant(embedding, name="W")
This is the easiest approach, but it is not memory efficient because the value of a tf.constant()
is stored multiple times in memory.这是最简单的方法,但它的内存效率不高,因为
tf.constant()
的值在内存中存储了多次。 Since embedding
can be very large, you should only use this approach for toy examples.由于
embedding
可能非常大,因此您应该仅将这种方法用于玩具示例。
Create W
as a tf.Variable
and initialize it from the NumPy array via atf.placeholder()
:创建
W
作为tf.Variable
并通过tf.placeholder()
从 NumPy 数组初始化它:
W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]), trainable=False, name="W") embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim]) embedding_init = W.assign(embedding_placeholder) # ... sess = tf.Session() sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
This avoid storing a copy of embedding
in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable
).这避免了在图中存储
embedding
的副本,但它确实需要足够的内存来同时在内存中保存矩阵的两个副本(一个用于 NumPy 数组,一个用于tf.Variable
)。 Note that I've assumed that you want to hold the embedding matrix constant during training, so W
is created with trainable=False
.请注意,我假设您希望在训练期间保持嵌入矩阵不变,因此
W
是使用trainable=False
创建的。
If the embedding was trained as part of another TensorFlow model, you can use a tf.train.Saver
to load the value from the other model's checkpoint file.如果嵌入是作为另一个 TensorFlow 模型的一部分进行训练的,您可以使用
tf.train.Saver
从另一个模型的检查点文件加载值。 This means that the embedding matrix can bypass Python altogether.这意味着嵌入矩阵可以完全绕过 Python。 Create
W
as in option 2, then do the following:在选项 2 中创建
W
,然后执行以下操作:
W = tf.Variable(...) embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W}) # ... sess = tf.Session() embedding_saver.restore(sess, "checkpoint_filename.ckpt")
我使用这种方法来加载和共享嵌入。
W = tf.get_variable(name="W", shape=embedding.shape, initializer=tf.constant_initializer(embedding), trainable=False)
2.0 Compatible Answer : There are many Pre-Trained Embeddings, which are developed by Google and which have been Open Sourced. 2.0 兼容答案:有许多预训练嵌入,由谷歌开发并已开源。
Some of them are Universal Sentence Encoder (USE), ELMO, BERT
, etc.. and it is very easy to reuse them in your code.其中一些是
Universal Sentence Encoder (USE), ELMO, BERT
等等,而且很容易在你的代码中重用它们。
Code to reuse the Pre-Trained Embedding
, Universal Sentence Encoder
is shown below:重用
Pre-Trained Embedding
Universal Sentence Encoder
的代码如下所示:
!pip install "tensorflow_hub>=0.6.0"
!pip install "tensorflow>=2.0.0"
import tensorflow as tf
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embed = hub.KerasLayer(module_url)
embeddings = embed(["A long sentence.", "single-word",
"http://example.com"])
print(embeddings.shape) #(3,128)
For more information the Pre-Trained Embeddings developed and open-sourced by Google, refer TF Hub Link .有关由 Google 开发和开源的 Pre-Trained Embeddings 的更多信息,请参阅TF Hub Link 。
The answer of @mrry is not right because it provoques the overwriting of the embeddings weights each the network is run, so if you are following a minibatch approach to train your network, you are overwriting the weights of the embeddings. @mrry 的答案是不正确的,因为它在每次运行网络时都会引起对嵌入权重的覆盖,因此如果您采用小批量方法来训练您的网络,则会覆盖嵌入的权重。 So, on my point of view the right way to pre-trained embeddings is:
因此,在我看来,预训练嵌入的正确方法是:
embeddings = tf.get_variable("embeddings", shape=[dim1, dim2], initializer=tf.constant_initializer(np.array(embeddings_matrix))
With tensorflow version 2 its quite easy if you use the Embedding layer如果您使用 Embedding 层,那么使用 tensorflow 版本 2 就很容易了
X=tf.keras.layers.Embedding(input_dim=vocab_size,
output_dim=300,
input_length=Length_of_input_sequences,
embeddings_initializer=matrix_of_pretrained_weights
)(ur_inp)
I was also facing embedding issue, So i wrote detailed tutorial with dataset.我也面临嵌入问题,所以我用数据集写了详细的教程。 Here I would like to add what I tried You can also try this method,
这里我想补充一下我试过的你也可以试试这个方法,
import tensorflow as tf
tf.reset_default_graph()
input_x=tf.placeholder(tf.int32,shape=[None,None])
#you have to edit shape according to your embedding size
Word_embedding = tf.get_variable(name="W", shape=[400000,100], initializer=tf.constant_initializer(np.array(word_embedding)), trainable=False)
embedding_loopup= tf.nn.embedding_lookup(Word_embedding,input_x)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for ii in final_:
print(sess.run(embedding_loopup,feed_dict={input_x:[ii]}))
Here is working detailed Tutorial Ipython example if you want to understand from scratch , take a look .这里是详细的教程Ipython 示例,如果您想从头开始理解,请看一看。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.