I am trying to train word2vec model on a simple toy dateset of 4 sentences. The Word2vec version that I need is:
Problem that I am facing is: No matter how I change the above parameters, the word vectors are not being updated/learned. The word vectors for epochs=1 and epochs=500 are being same.
from gensim.models import Word2Vec
import numpy as np
import matplotlib.pyplot as plt
import nltk
# toy dataset with 4 sentences
sents = ['what is the time',
'what is the day',
'what time is the meeting',
'cancel the meeting']
sents = [nltk.word_tokenize(string) for string in sents]
# model initialization and training
model = Word2Vec(alpha=0.5, min_alpha =0.25, min_count = 0, size=2, window=4,
workers=1, sg = 1, hs = 0, negative = 0, sample=0, seed = 42)
model.build_vocab(sents)
model.train(sents, total_examples=4, epochs=500)
# getting word vectors into array
vocab = model.wv.vocab.keys()
vocab_vectors = model.wv[vocab]
print(vocab)
print(vocab_vectors)
#plotting word vectors
plt.scatter(vocab_vectors[:,0], vocab_vectors[:,1], c ="blue")
for i, word in enumerate(vocab):
plt.annotate(word, (vocab_vectors[i,0], vocab_vectors[i,1]))
The out put of print(vocab)
is as below
['what', 'is', 'time', 'cancel', 'the', 'meeting', 'day']
The output of print(vocab_vectors)
is as below
[[ 0.08136337 -0.05059118]
[ 0.06549312 -0.22880174]
[-0.08925873 -0.124718 ]
[ 0.05645624 -0.03120007]
[ 0.15067646 -0.14344342]
[-0.12645201 0.06202405]
[-0.22905378 -0.01489289]]
Why do I think the vectors are not being learned? I am changing the epochs value to 1, 10, 50, 500... and running the whole code to check the output for each run. For epochs = #any_value <1,10,50,500>, the output (vocab, vocab_vectors, and the plot) is being same for all the runs.
By providing the parameters negative=0, hs=0
, you've disabled both training modes, and no training is happening.
You should either leave the default non-zero negative
value in place, or enable the non-default hierarchical-softmax mode while disabling negative-sampling (with hs=1, negative=0
).
Other thoughts:
min_count
is usually a bad idea with any realistic dataset, as word2vec needs multiple varied examples of a word's usage to train useful vectors – and it's usually better to ignore rare words than mix their incomplete info in.alpha
/ min_alpha
is also usually a bad idea – though perhaps here you were just trying extreme values to trigger any change.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.