简体   繁体   中英

Why I cannot reproduce word2vec results using gensim

I am not able to reproduce the word2vec results using Gensim, and some of the results do not make sense. Gensim is an open-source toolkit, is intended for handling large text collections using efficient online algorithms, including the python implementation of Google's word2vec algorithm .

I am following an online tutorial and am not able reproduece the results. The most similar words for (positive=['woman', 'king'], negative=['man']) were supposed to to 'wenceslaus'and 'queen'. In stead, I got 'u'eleonore' and 'iv'. The most similar for 'fast' was slow and for 'quick' was 'mitsumi'.

Any insights? Below are my codes and results:

>>> from gensim.models import word2vec

>>> import logging

>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> sentences = word2vec.Text8Corpus('\\tmp\\text8')

>>> model = word2vec.Word2Vec(sentences, size=200)

>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=2)

out[63]: [(u'eleonore', 0.5138808...), (u'iv',0.510519325...)]

>>> model.most_similar(positive=['fast'])

Out[64]: [(u'slow', 0.48932...), (u'paced', 0.46925...)...]

>>> model.most_similar(positive=['quick'],topn=1)

out[65]: [(u'mitsumi', 0.48545..)]

Your results do make sense.

word2vec has several reasons for its randomness - random vector initialization, threading, etc. - so it's not strange that you don't get the exactly same results with the tutorial.

Also, "eleonore" is the name of a princess and "iv" a Roman numeral; both terms are related to the desired "queen". When skeptical with the results, try inspecting the text itself:

>>> import nltk
>>> with open('/tmp/text8', 'r') as f:
>>>     text = nltk.Text(f.read().split()
>>> text.concordance('eleonore')

Displaying 6 of 6 matches:
en the one eight year old princess eleonore of portugal whose dowry helped him
nglish historian one six five five eleonore gonzaga wife of ferdinand ii holy 
riage in one six zero three was to eleonore of hohenzollern born one five eigh
frederick duke of prussia and mary eleonore of kleve children of joachim frede
ive child of joachim frederick and eleonore of hohenzollern marie eleonore bor
and eleonore of hohenzollern marie eleonore born two two march one six zero se

If you are still dissatisfied with your results, however, here are some things you might want to:

  1. Try multiple runs. They will all results in different vectors. (Not a smart way though.)
  2. Try a larger topn and observe more than just one or two similar terms. "eleonore" or "iv" might be close competitors with "queen".

     >>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=20) [('iii', 0.51035475730896), ('vii', 0.5096821188926697), ('frederick', 0.5058648586273193), ('son', 0.5021922588348389), ('wenceslaus', 0.500456690788269), ('eleonore', 0.49771684408187866), ('iv', 0.4948177933692932), ('henry', 0.49309787154197693), ('viii', 0.4924878478050232), ('sigismund', 0.49033164978027344), ('letsie', 0.4879177212715149), ('wladislaus', 0.4867924451828003), ('boleslaus', 0.47995251417160034), ('dagobert', 0.4767090082168579), ('corvinus', 0.476703941822052), ('abdicates', 0.47494029998779297), ('jadwiga', 0.4712049961090088), ('eldest', 0.4683353900909424), ('anjou', 0.46781229972839355), ('queen', 0.46647682785987854)] 
  3. Try adjusting the min_count of words. This will help you to remove uncommon, and seemingly "noisy" words. (The default min_count is 5.)

     >>> model = word2vec.Word2Vec(sentences, size=200, min_count=30) >>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=20) [('queen', 0.5332179665565491), ('son', 0.5205873250961304), ('daughter', 0.49179190397262573), ('henry', 0.4898293614387512), ('antipope', 0.4872135818004608), ('eldest', 0.48199930787086487), ('viii', 0.47991085052490234), ('matilda', 0.4746955633163452), ('iii', 0.4663817882537842), ('duke', 0.46338942646980286), ('jadwiga', 0.4630076289176941), ('vii', 0.45885157585144043), ('aquitaine', 0.45757925510406494), ('vasa', 0.45703941583633423), ('pretender', 0.4559580683708191), ('reigned', 0.4528595805168152), ('marries', 0.4490123391151428), ('philip', 0.44660788774490356), ('anne', 0.4405106008052826), ('princess', 0.43850386142730713)] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM