[英]gensim Doc2Vec vs tensorflow Doc2Vec
I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. 我正在尝试比较我的Doc2Vec(通过tf)和gensims实现的实现。 It seems atleast visually that the gensim ones are performing better.
从视觉上看,gensim的表现更好。
I ran the following code to train the gensim model and the one below that for tensorflow model. 我运行以下代码来训练gensim模型和下面的那个用于tensorflow模型。 My questions are as follows:
我的问题如下:
window=5
parameter in gensim mean that I am using two words on either side to predict the middle one? window=5
参数是否意味着我在两边使用两个词来预测中间的一个? Or is it 5 on either side. model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores)
model.build_vocab(corpus)
epochs = 100
for i in range(epochs):
model.train(corpus)
batch_size = 512
embedding_size = 100 # Dimension of the embedding vector.
num_sampled = 10 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
# Input data.
train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window])
train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1])
# The variables
word_embeddings = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0))
softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size],
stddev=1.0 / np.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
###########################
# Model.
###########################
# Look up embeddings for inputs and stack words side by side
embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset),
shape=[int(batch_size/context_window),-1])
embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset)
embed = tf.concat(1,[embed_words, embed_docs])
# Compute the softmax loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
train_labels, num_sampled, vocabulary_size))
# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
Check out the jupyter notebook here (I have both models working and tested in here). 在这里查看jupyter笔记本(我在这里有两个型号工作和测试)。 It still feels like the gensim model is performing better in this initial analysis.
在初始分析中,gensim模型仍然表现得更好。
Old question, but an answer would be useful for future visitors. 老问题,但答案对未来的访客有用。 So here are some of my thoughts.
所以这是我的一些想法。
There are some problems in the tensorflow
implementation: 张量
tensorflow
实现中存在一些问题:
window
is 1-side size, so window=5
would be 5*2+1
= 11
words. window
是单面大小,因此window=5
将是5*2+1
= 11
单词。 batch_size
would be the number of documents. batch_size
将是文档数。 So train_word_dataset
shape would be batch_size * context_window
, while train_doc_dataset
and train_labels
shapes would be batch_size
. train_word_dataset
形状将是batch_size * context_window
,而train_doc_dataset
和train_labels
形状将是batch_size
。 sampled_softmax_loss
is not negative_sampling_loss
. sampled_softmax_loss
不是negative_sampling_loss
。 They are two different approximations of softmax_loss
. softmax_loss
两种不同近似值。 So for the OP's listed questions: 所以对于OP列出的问题:
doc2vec
in tensorflow
is working and correct in its own way, but it is different from both the gensim
implementation and the paper. doc2vec
中tensorflow
这种实现以其自己的方式工作和纠正,但它与gensim
实现和论文不同。 window
is 1-side size as said above. window
是如上所述的单面尺寸。 If document size is less than context size, then the smaller one would be use. gensim
implementation is faster. gensim
实现速度更快的原因有很多。 First, gensim
was optimized heavily, all operations are faster than naive python operations, especially data I/O. gensim
进行了大量优化,所有操作都比天真的python操作更快,尤其是数据I / O. Second, some preprocessing steps such as min_count
filtering in gensim
would reduce the dataset size. min_count
在过滤gensim
将减小数据集大小。 More importantly, gensim
uses negative_sampling_loss
, which is much faster than sampled_softmax_loss
, I guess this is the main reason. gensim
使用negative_sampling_loss
,这比sampled_softmax_loss
,我猜这是主要原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.