I construct sentences using 3 words "1", "2", "3", in different ways, and observe that the word vectors are unchanged for each of these words.
Following are the different sentences
Type 1: [["1", "2"], ["1", "3"]]
Type 2: [["1", "2", "3"]]
Type 3: [["1", "2"], ["3"]]
I am training Word2Vec
model as follows
model = Word2Vec(sentences,min_count=1,size=2)
print (model.wv.most_similar("1"))
print (model.wv.most_similar("2"))
print (model.wv.most_similar("3"))
print (model.wv['1'])
print (model.wv['2'])
print (model.wv['3'])
And results are same on changing the sentence type
[('3', 0.5377859473228455), ('2', -0.5831003785133362)]
[('1', -0.5831003189086914), ('3', -0.9985027313232422)]
[('1', 0.5377858281135559), ('2', -0.9985026717185974)]
[-0.24893647 -0.24495095]
[ 0.19231372 -0.03319569]
[-0.22207274 0.05098101]
Also when I change word "1" to suppose "101", the result changes
[('3', 0.5407046675682068), ('2', -0.5859125256538391)]
[('101', -0.5859125256538391), ('3', -0.9985027313232422)]
[('101', 0.540704607963562), ('2', -0.9985026717185974)]
[-0.05898098 -0.0576357 ]
[ 0.19231372 -0.03319569]
[-0.22207274 0.05098101]
I wanted to know
Why the results didn't change when I changed the sentences?
Why results changed when I just updated the value?
Word2Vec as an algorithm requires large, varied datasets to train word-vectors into meaningful arrangements.
You won't get sensible results, or learn much about the algorithm's behavior or benefits, with toy-sized contrived training data.
Also note:
Word2Vec
uses random initialization and random-sampling as part of its process, so even a run on the exact same data can have different results from run-to-run. However, with a realistic amount and variety of data, each run should result in a model that's about-as-useful as any other run (even though many of the exact positions/relative-rankings may vary).
A nested list-of-lists, like your 1st "sentence" ( [["1", "2"], ["1", "3"]]
), isn't valid training input for Word2Vec
. Each "sentence" should be a simple list of string-tokens (words).
I suggest you experiment with true natural-language training data, with the quantity of words, and variety of contrasting usages, that appear in real texts. I recommend training data that has at least a few thousand unique words where each word has 5 or more diverse usage examples.
There's actually a small such corpus bundled inside gensim, as an aid to its self-testing code and intro tutorials. It's called the 'Lee Corpus', and is an old research corpus of about 300 short news articles of a couple hundred words each. You can see an example of its use at the "Training Your Own Model" section of the gensim Word2Vec tutorial:
https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#training-your-own-model
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.