簡體   English   中英

為什么我無法使用gensim復制word2vec結果

[英]Why I cannot reproduce word2vec results using gensim

我無法使用Gensim復制word2vec結果,並且其中一些結果沒有意義。 Gensim是一個開源工具包,旨在使用有效的在線算法(包括Google的word2vec算法python實現)處理大型文本集合。

我正在關注在線教程 ,因此無法復制結果。 (positive = ['woman','king'],negative = ['man'])的最相似詞應該是“ wenceslaus”和“ queen”。 相反,我得到了“ u'eleonore”和“ iv”。 “快速”最相似的是緩慢的,而“快速”最相似的是“ mitsumi”。

有什么見解嗎? 以下是我的代碼和結果:

>>>從gensim.models導入word2vec

>>>導入日志

>>> logging.basicConfig(format ='%(asctime)s:%(levelname)s:%(message)s',level = logging.INFO)

>>>句子= word2vec.Text8Corpus('\\ tmp \\ text8')

>>>模型= word2vec.Word2Vec(句子,大小= 200)

>>> model.most_like(positive = ['woman','king'],negative = ['man'],topn = 2)

out [63]:[(u'eleonore',0.5138808 ...),(u'iv',0.510519325 ...)]

>>> model.most_like(positive = ['fast'])

Out [64]:[(u'slow',0.48932 ...),(u'paced',0.46925 ...)...]

>>> model.most_like(positive = ['quick'],topn = 1)

out [65]:[(u'mitsumi',0.48545 ..)]

您的結果確實有意義。

word2vec的隨機性有幾個原因-隨機向量初始化,線程化等-因此,您在本教程中不會獲得完全相同的結果也就不足為奇了。

另外,“ eleonore”是公主的名字,“ iv”是羅馬數字; 這兩個術語都與所需的“女王”有關。 如果對結果表示懷疑,請嘗試檢查文本本身:

>>> import nltk
>>> with open('/tmp/text8', 'r') as f:
>>>     text = nltk.Text(f.read().split()
>>> text.concordance('eleonore')

Displaying 6 of 6 matches:
en the one eight year old princess eleonore of portugal whose dowry helped him
nglish historian one six five five eleonore gonzaga wife of ferdinand ii holy 
riage in one six zero three was to eleonore of hohenzollern born one five eigh
frederick duke of prussia and mary eleonore of kleve children of joachim frede
ive child of joachim frederick and eleonore of hohenzollern marie eleonore bor
and eleonore of hohenzollern marie eleonore born two two march one six zero se

但是,如果您仍然對結果不滿意,則可能需要執行以下操作:

  1. 嘗試多次運行。 它們都將導致不同的向量。 (雖然不是一個聰明的方法。)
  2. 嘗試使用較大的topn並觀察到不止一個或兩個類似的術語。 “ eleonore”或“ iv”可能與“ queen”密切競爭。

     >>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=20) [('iii', 0.51035475730896), ('vii', 0.5096821188926697), ('frederick', 0.5058648586273193), ('son', 0.5021922588348389), ('wenceslaus', 0.500456690788269), ('eleonore', 0.49771684408187866), ('iv', 0.4948177933692932), ('henry', 0.49309787154197693), ('viii', 0.4924878478050232), ('sigismund', 0.49033164978027344), ('letsie', 0.4879177212715149), ('wladislaus', 0.4867924451828003), ('boleslaus', 0.47995251417160034), ('dagobert', 0.4767090082168579), ('corvinus', 0.476703941822052), ('abdicates', 0.47494029998779297), ('jadwiga', 0.4712049961090088), ('eldest', 0.4683353900909424), ('anjou', 0.46781229972839355), ('queen', 0.46647682785987854)] 
  3. 嘗試調整min_count字數。 這將幫助您刪除不常見且看似“嘈雜”的單詞。 (默認的min_count為5。)

     >>> model = word2vec.Word2Vec(sentences, size=200, min_count=30) >>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=20) [('queen', 0.5332179665565491), ('son', 0.5205873250961304), ('daughter', 0.49179190397262573), ('henry', 0.4898293614387512), ('antipope', 0.4872135818004608), ('eldest', 0.48199930787086487), ('viii', 0.47991085052490234), ('matilda', 0.4746955633163452), ('iii', 0.4663817882537842), ('duke', 0.46338942646980286), ('jadwiga', 0.4630076289176941), ('vii', 0.45885157585144043), ('aquitaine', 0.45757925510406494), ('vasa', 0.45703941583633423), ('pretender', 0.4559580683708191), ('reigned', 0.4528595805168152), ('marries', 0.4490123391151428), ('philip', 0.44660788774490356), ('anne', 0.4405106008052826), ('princess', 0.43850386142730713)] 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM