简体   繁体   中英

Understanding gensim Word2Vec most_similar results for 3 words

I construct sentences using 3 words "1", "2", "3", in different ways, and observe that the word vectors are unchanged for each of these words.

Following are the different sentences

Type 1: [["1", "2"], ["1", "3"]]

Type 2: [["1", "2", "3"]]

Type 3: [["1", "2"], ["3"]]

I am training Word2Vec model as follows

model = Word2Vec(sentences,min_count=1,size=2)
print (model.wv.most_similar("1"))
print (model.wv.most_similar("2"))
print (model.wv.most_similar("3"))
print (model.wv['1'])
print (model.wv['2'])
print (model.wv['3'])

And results are same on changing the sentence type

[('3', 0.5377859473228455), ('2', -0.5831003785133362)]
[('1', -0.5831003189086914), ('3', -0.9985027313232422)]
[('1', 0.5377858281135559), ('2', -0.9985026717185974)]
[-0.24893647 -0.24495095]
[ 0.19231372 -0.03319569]
[-0.22207274  0.05098101]

Also when I change word "1" to suppose "101", the result changes

[('3', 0.5407046675682068), ('2', -0.5859125256538391)]
[('101', -0.5859125256538391), ('3', -0.9985027313232422)]
[('101', 0.540704607963562), ('2', -0.9985026717185974)]
[-0.05898098 -0.0576357 ]
[ 0.19231372 -0.03319569]
[-0.22207274  0.05098101]

I wanted to know

  1. Why the results didn't change when I changed the sentences?

  2. Why results changed when I just updated the value?

Word2Vec as an algorithm requires large, varied datasets to train word-vectors into meaningful arrangements.

You won't get sensible results, or learn much about the algorithm's behavior or benefits, with toy-sized contrived training data.

Also note:

  • Word2Vec uses random initialization and random-sampling as part of its process, so even a run on the exact same data can have different results from run-to-run. However, with a realistic amount and variety of data, each run should result in a model that's about-as-useful as any other run (even though many of the exact positions/relative-rankings may vary).

  • A nested list-of-lists, like your 1st "sentence" ( [["1", "2"], ["1", "3"]] ), isn't valid training input for Word2Vec . Each "sentence" should be a simple list of string-tokens (words).

I suggest you experiment with true natural-language training data, with the quantity of words, and variety of contrasting usages, that appear in real texts. I recommend training data that has at least a few thousand unique words where each word has 5 or more diverse usage examples.

There's actually a small such corpus bundled inside gensim, as an aid to its self-testing code and intro tutorials. It's called the 'Lee Corpus', and is an old research corpus of about 300 short news articles of a couple hundred words each. You can see an example of its use at the "Training Your Own Model" section of the gensim Word2Vec tutorial:

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#training-your-own-model

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM