简体   繁体   English

gensim Doc2Vec vs tensorflow Doc2Vec

[英]gensim Doc2Vec vs tensorflow Doc2Vec

I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. 我正在尝试比较我的Doc2Vec(通过tf)和gensims实现的实现。 It seems atleast visually that the gensim ones are performing better. 从视觉上看,gensim的表现更好。

I ran the following code to train the gensim model and the one below that for tensorflow model. 我运行以下代码来训练gensim模型和下面的那个用于tensorflow模型。 My questions are as follows: 我的问题如下:

  1. Is my tf implementation of Doc2Vec correct. 我的Doc2Vec实现是否正确。 Basically is it supposed to be concatenating the word vectors and the document vector to predict the middle word in a certain context? 基本上它应该是连接单词向量和文档向量来预测某个上下文中的中间单词吗?
  2. Does the window=5 parameter in gensim mean that I am using two words on either side to predict the middle one? gensim中的window=5参数是否意味着我在两边使用两个词来预测中间的一个? Or is it 5 on either side. 或者两边都是5。 Thing is there are quite a few documents that are smaller than length 10. 事情是有相当多的文件小于长度10。
  3. Any insights as to why Gensim is performing better? 关于为什么Gensim表现更好的任何见解? Is my model any different to how they implement it? 我的模型与他们如何实现它有什么不同吗?
  4. Considering that this is effectively a matrix factorisation problem, why is the TF model even getting an answer? 考虑到这实际上是一个矩阵因子分解问题,为什么TF模型甚至得到答案? There are infinite solutions to this since its a rank deficient problem. 由于它是一个排名不足的问题,因此有无限的解决方案。 <- This last question is simply a bonus. < - 最后一个问题只是奖金。

Gensim Gensim

model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores)
model.build_vocab(corpus)
epochs = 100
for i in range(epochs):
    model.train(corpus)

TF TF

batch_size = 512
embedding_size = 100 # Dimension of the embedding vector.
num_sampled = 10 # Number of negative examples to sample.


graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):
    # Input data.
    train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size])
    train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1])

    # The variables   
    word_embeddings =  tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
    doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0))
    softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size],
                             stddev=1.0 / np.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))

    ###########################
    # Model.
    ###########################
    # Look up embeddings for inputs and stack words side by side
    embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset),
                            shape=[int(batch_size/context_window),-1])
    embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset)
    embed = tf.concat(1,[embed_words, embed_docs])
    # Compute the softmax loss, using a sample of the negative labels each time.
    loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
                                   train_labels, num_sampled, vocabulary_size))

    # Optimizer.
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

Update: 更新:

Check out the jupyter notebook here (I have both models working and tested in here). 在这里查看jupyter笔记本(我在这里有两个型号工作和测试)。 It still feels like the gensim model is performing better in this initial analysis. 在初始分析中,gensim模型仍然表现得更好。

Old question, but an answer would be useful for future visitors. 老问题,但答案对未来的访客有用。 So here are some of my thoughts. 所以这是我的一些想法。

There are some problems in the tensorflow implementation: 张量tensorflow实现中存在一些问题:

  • window is 1-side size, so window=5 would be 5*2+1 = 11 words. window是单面大小,因此window=5将是5*2+1 = 11单词。
  • Note that with PV-DM version of doc2vec, the batch_size would be the number of documents. 请注意,对于PV-DM版本的doc2vec, batch_size将是文档数。 So train_word_dataset shape would be batch_size * context_window , while train_doc_dataset and train_labels shapes would be batch_size . 因此train_word_dataset形状将是batch_size * context_window ,而train_doc_datasettrain_labels形状将是batch_size
  • More importantly, sampled_softmax_loss is not negative_sampling_loss . 更重要的是, sampled_softmax_loss不是negative_sampling_loss They are two different approximations of softmax_loss . 它们是softmax_loss两种不同近似值。

So for the OP's listed questions: 所以对于OP列出的问题:

  1. This implementation of doc2vec in tensorflow is working and correct in its own way, but it is different from both the gensim implementation and the paper. 张量doc2vectensorflow这种实现以其自己的方式工作和纠正,但它与gensim实现和论文不同。
  2. window is 1-side size as said above. window是如上所述的单面尺寸。 If document size is less than context size, then the smaller one would be use. 如果文档大小小于上下文大小,则使用较小的文档大小。
  3. There are many reasons why gensim implementation is faster. gensim实现速度更快的原因有很多。 First, gensim was optimized heavily, all operations are faster than naive python operations, especially data I/O. 首先, gensim进行了大量优化,所有操作都比天真的python操作更快,尤其是数据I / O. Second, some preprocessing steps such as min_count filtering in gensim would reduce the dataset size. 其次,一些预处理步骤如min_count在过滤gensim将减小数据集大小。 More importantly, gensim uses negative_sampling_loss , which is much faster than sampled_softmax_loss , I guess this is the main reason. 更重要的是, gensim使用negative_sampling_loss ,这比sampled_softmax_loss ,我猜这是主要原因。
  4. Is it easier to find somethings when there are many of them? 当有很多东西时,更容易找到一些东西吗? Just kidding ;-) 开玩笑 ;-)
    It's true that there are many solutions in this non-convex optimization problem, so the model would just find a local optimum. 确实,这个非凸优化问题有很多解,所以模型只能找到局部最优。 Interestingly, in neural network, most local optima are "good enough". 有趣的是,在神经网络中,大多数局部最优都是“足够好”。 It has been observed that stochastic gradient descent seems to find better local optima than larger batch gradient descent, although this is still a riddle in current research. 已经观察到随机梯度下降似乎比较大的批量梯度下降找到更好的局部最优值,尽管这仍然是当前研究中的一个例子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM