简体   繁体   中英

Gensim equivalent of training steps

Does gensim Word2Vec have an option that is the equivalent of "training steps" in the TensorFlow word2vec example here: Word2Vec Basic ? If not, what default value does gensim use? Is the gensim parameter iter related to training steps?

The TensorFlow script includes this section.

with tf.Session(graph=graph) as session:
    # We must initialize all variables before we use them.
    init.run()
    print('Initialized')

    average_loss = 0
    for step in xrange(num_steps):
        batch_inputs, batch_labels = generate_batch(
            batch_size, num_skips, skip_window)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

    # We perform one update step by evaluating the optimizer op (including it
    # in the list of returned values for session.run()
    _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += loss_val

    if step % 2000 == 0:
        if step > 0:
            average_loss /= 2000
        # The average loss is an estimate of the loss over the last 2000 batches.
        print('Average loss at step ', step, ': ', average_loss)
        average_loss = 0

    # Note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
        sim = similarity.eval()
        for i in xrange(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 8  # number of nearest neighbors
            nearest = (-sim[i, :]).argsort()[1:top_k + 1]
            log_str = 'Nearest to %s:' % valid_word
            for k in xrange(top_k):
                close_word = reverse_dictionary[nearest[k]]
                log_str = '%s %s,' % (log_str, close_word)
            print(log_str)
  final_embeddings = normalized_embeddings.eval()

In the TensorFlow example, if I perform T-SNE on the embeddings and plot with matplotlib, the plot looks more reasonable to me when the number of steps is high. I am using a small corpus of 1,200 emails. One way it looks more reasonable is that numbers are clustered together. I would like to attain the same apparent level of quality using gensim.

Yes, Word2Vec class constructor has iter argument:

iter = number of iterations (epochs) over the corpus. Default is 5.

Also, if you call Word2Vec.train() method directly, you can pass in epochs argument that has the same meaning.

The number of actual training steps is deduced from epochs, but depends on other parameters like text size, window size and batch size. If you're just looking to improve the quality of embedding vectors, increasing iter is the right way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM