简体   繁体   English

Gensim等同于培训步骤

[英]Gensim equivalent of training steps

Does gensim Word2Vec have an option that is the equivalent of "training steps" in the TensorFlow word2vec example here: Word2Vec Basic ? gensim Word2Vec是否具有与TensorFlow word2vec示例中的“培训步骤”等效的选项: Word2Vec Basic If not, what default value does gensim use? 如果不是,gensim使用什么默认值? Is the gensim parameter iter related to training steps? gensim参数iter与训练步骤有关?

The TensorFlow script includes this section. TensorFlow脚本包括此部分。

with tf.Session(graph=graph) as session:
    # We must initialize all variables before we use them.
    init.run()
    print('Initialized')

    average_loss = 0
    for step in xrange(num_steps):
        batch_inputs, batch_labels = generate_batch(
            batch_size, num_skips, skip_window)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

    # We perform one update step by evaluating the optimizer op (including it
    # in the list of returned values for session.run()
    _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += loss_val

    if step % 2000 == 0:
        if step > 0:
            average_loss /= 2000
        # The average loss is an estimate of the loss over the last 2000 batches.
        print('Average loss at step ', step, ': ', average_loss)
        average_loss = 0

    # Note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
        sim = similarity.eval()
        for i in xrange(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 8  # number of nearest neighbors
            nearest = (-sim[i, :]).argsort()[1:top_k + 1]
            log_str = 'Nearest to %s:' % valid_word
            for k in xrange(top_k):
                close_word = reverse_dictionary[nearest[k]]
                log_str = '%s %s,' % (log_str, close_word)
            print(log_str)
  final_embeddings = normalized_embeddings.eval()

In the TensorFlow example, if I perform T-SNE on the embeddings and plot with matplotlib, the plot looks more reasonable to me when the number of steps is high. 在TensorFlow示例中,如果我对嵌入执行T-SNE并使用matplotlib进行绘制,则在步骤数较多时,该绘制对我来说看起来更合理。 I am using a small corpus of 1,200 emails. 我正在使用一小部分1200封电子邮件。 One way it looks more reasonable is that numbers are clustered together. 看起来更合理的一种方法是将数字聚集在一起。 I would like to attain the same apparent level of quality using gensim. 我想使用gensim达到相同的外观质量。

Yes, Word2Vec class constructor has iter argument: 是的, Word2Vec类构造函数具有iter参数:

iter = number of iterations (epochs) over the corpus. iter =语料库上的迭代次数(时期)。 Default is 5. 默认值为5。

Also, if you call Word2Vec.train() method directly, you can pass in epochs argument that has the same meaning. 另外,如果你调用Word2Vec.train()直接方法,您可以通过epochs具有相同含义的说法。

The number of actual training steps is deduced from epochs, but depends on other parameters like text size, window size and batch size. 实际训练步骤的数量是根据时期得出的,但取决于其他参数,例如文本大小,窗口大小和批处理大小。 If you're just looking to improve the quality of embedding vectors, increasing iter is the right way. 如果您只是想提高嵌入向量的质量,那么增加iter是正确的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM