简体   繁体   English

Gensim 3.8.3 Word2Vec 不更新玩具数据集的权重/参数

[英]Gensim 3.8.3 Word2Vec is not updating the weights/parameters on a toy dataset

I am trying to train word2vec model on a simple toy dateset of 4 sentences.我正在尝试在 4 个句子的简单玩具数据集上训练 word2vec 模型。 The Word2vec version that I need is:我需要的 Word2vec 版本是:

  • Skip-gram model Skip-gram 模型
  • no negative sampling无负抽样
  • no hierarchical soft-max没有分层 soft-max
  • no removal or down-scaling of frequent words没有删除或缩减常用词
  • vector size of words is 2单词的向量大小为 2
  • Window size 4 ie all the words in a sentence are considered context words of each other.窗口大小 4,即一个句子中的所有单词都被认为是彼此的上下文单词。
  • epochs can be varied from 1 to 500纪元可以从 1 到 500 不等

Problem that I am facing is: No matter how I change the above parameters, the word vectors are not being updated/learned.我面临的问题是:无论我如何更改上述参数,都不会更新/学习词向量。 The word vectors for epochs=1 and epochs=500 are being same. epochs=1 和 epochs=500 的词向量是相同的。

from gensim.models import Word2Vec
import numpy as np
import matplotlib.pyplot as plt
import nltk

# toy dataset with 4 sentences
sents = ['what is the time',
         'what is the day',
         'what time is the meeting',
         'cancel the meeting']

sents = [nltk.word_tokenize(string) for string in sents]

# model initialization and training
model = Word2Vec(alpha=0.5, min_alpha =0.25, min_count = 0, size=2, window=4,
                 workers=1, sg = 1, hs = 0, negative = 0, sample=0, seed = 42)

model.build_vocab(sents)
model.train(sents, total_examples=4, epochs=500)

# getting word vectors into array
vocab = model.wv.vocab.keys()
vocab_vectors = model.wv[vocab]
print(vocab)
print(vocab_vectors)

#plotting word vectors
plt.scatter(vocab_vectors[:,0], vocab_vectors[:,1], c ="blue")
for i, word in enumerate(vocab):
    plt.annotate(word, (vocab_vectors[i,0], vocab_vectors[i,1]))

The out put of print(vocab) is as below print(vocab)如下

['what', 'is', 'time', 'cancel', 'the', 'meeting', 'day']

The output of print(vocab_vectors) is as below print(vocab_vectors)的输出如下

[[ 0.08136337 -0.05059118]
 [ 0.06549312 -0.22880174]
 [-0.08925873 -0.124718  ]
 [ 0.05645624 -0.03120007]
 [ 0.15067646 -0.14344342]
 [-0.12645201  0.06202405]
 [-0.22905378 -0.01489289]]

The plotted 2D vectors绘制的二维向量这里!

Why do I think the vectors are not being learned?为什么我认为没有学习向量? I am changing the epochs value to 1, 10, 50, 500... and running the whole code to check the output for each run.我将 epochs 值更改为 1、10、50、500...并运行整个代码以检查每次运行的输出。 For epochs = #any_value <1,10,50,500>, the output (vocab, vocab_vectors, and the plot) is being same for all the runs.对于 epochs = #any_value <1,10,50,500>,所有运行的输出(vocab、vocab_vectors 和绘图)都相同。

By providing the parameters negative=0, hs=0 , you've disabled both training modes, and no training is happening.通过提供参数negative=0, hs=0 ,您已经禁用了两种训练模式,并且没有进行任何训练。

You should either leave the default non-zero negative value in place, or enable the non-default hierarchical-softmax mode while disabling negative-sampling (with hs=1, negative=0 ).您应该保留默认的非零negative ,或者在禁用负采样(使用hs=1, negative=0 )的同时启用非默认的分层 softmax 模式。

Other thoughts:其他想法:

  • Enabling logging at the INFO level is often helpful, and might have shown progress output which better hinted to you that no real training was happening在 INFO 级别启用日志记录通常很有帮助,并且可能会显示进度输出,这更好地向您暗示没有进行真正的培训
  • Etill, with a tiny toy dataset, the biggest hint that all training was disabled – suspiciously instant completion of training – is nearly indistinguishable from a tiny amount of training. Etill 使用一个很小的玩具数据集,最大的暗示是所有训练都被禁用了——可疑地立即完成了训练——与少量的训练几乎没有区别。 Generally, lots of things will be weird or disappointing with tiny datasets (& tiny vector sizes), as word2vec's usual benefits really depend on large amounts of text.通常,对于小数据集(和小向量大小),很多事情会很奇怪或令人失望,因为 word2vec 的通常好处实际上取决于大量文本。
  • Lowering min_count is usually a bad idea with any realistic dataset, as word2vec needs multiple varied examples of a word's usage to train useful vectors – and it's usually better to ignore rare words than mix their incomplete info in.对于任何现实数据集,降低min_count通常都是一个坏主意,因为 word2vec 需要多个不同的单词用法示例来训练有用的向量——通常忽略稀有单词比混合不完整的信息更好。
  • Changing the default alpha / min_alpha is also usually a bad idea – though perhaps here you were just trying extreme values to trigger any change.更改默认的alpha / min_alpha通常也是一个坏主意 - 尽管在这里您可能只是尝试使用极端值来触发任何更改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM