如何從word2vec的Google預訓練模型中提取單詞向量？

Question

GoogleNews-vectors-negative300.bin文件包含3億個單詞向量。 我認為（不確定）在編寫以下行時已加載此文件：

from gensim.models.keyedvectors import KeyedVectors

我想下載我從外部提供的words列表中的words vectors。 這是我執行此操作的代碼：

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'ele', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'landuse', 'layer', 'leisure', 'man', 'manmade', 'market', 'marketplace', 'maxheight', 'name', 'natural', 'noexit', 'oneway', 'park', 'parking', 'pgs', 'place', 'worship', 'playground', 'police', 'police station', '', 'post', 'post box or mail', 'power', 'powerstation', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'toilets', 'tourism', 'unknown', 'vehicle', 'vending', 'vending machine', 'village', 'wall', 'waste', 'water', 'waterway', 'worship'];

model = gensim.models.KeyedVectors.load_word2vec_format(words, binary=True)

M = len(words)
count = 0
for i in range(1,M):
    wi = id2word[words[i]]
    if wi in word2vec.vocab:
        vector[:,count] = model[:,i]
        count = count+1

f = open('word_vectors.csv', 'w')
print(vector, file=f)
f.close()

但是，當我運行代碼時，它只會凍結系統。 是否因為在搜索單詞中的words之前加載了整個二進制文件？ 如果是，我該如何解決這個問題？ 我收到以下警告時就想到了這一點，這就是為什么我使用warning包來禁止它的原因：

c:\Python35\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

它給出的錯誤是：

Traceback (most recent call last):
  File "word2vec.py", line 18, in <module>
    model = gensim.models.KeyedVectors.load_word2vec_format(topic, binary=True) 
  File "c:\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 196, in load_word2vec_format
    with utils.smart_open(fname) as fin:
  File "c:\Python35\lib\site-packages\smart_open\smart_open_lib.py", line 208, in smart_open
    raise TypeError('don\'t know how to handle uri %s' % repr(uri))
TypeError: don't know how to handle uri [['access'], ['aeroway'], ['airport'], ['amenity'], ['area'], ['atm'], ['barrier'], ['bay'], ['bench'], ['boundary'], ['bridge'], ['building'], ['bus'], ['cafe'], ['car'], ['coast'], ['continue'], ['created'], ['defibrillator'], ['drinking'], ['ele'], ['embankment'], ['entrance'], ['ferry'], ['foot'], ['fountain'], ['fuel'], ['gate'], ['golf'], ['gps'], ['grave'], ['highway'], ['horse'], ['hospital'], ['house'], ['landuse'], ['layer'], ['leisure'], ['man'], ['manmade'], ['market'], ['marketplace'], ['maxheight'], ['name'], ['natural'], ['noexit'], ['oneway'], ['park'], ['parking'], ['pgs'], ['place'], ['worship'], ['playground'], ['police'], ['police station'], [''], ['post'], ['post box or mail'], ['power'], ['powerstation'], ['private'], ['public'], ['railway'], ['ref'], ['residential'], ['restaurant'], ['road'], ['route'], ['school'], ['shelter'], ['shop'], ['source'], ['sport'], ['toilet'], ['toilets'], ['tourism'], ['unknown'], ['vehicle'], ['vending'], ['vending machine'], ['village'], ['wall'], ['waste'], ['water'], ['waterway'], ['worship']]

我猜這意味着程序無法在二進制文件中搜索單詞。 那么，如何解決呢？

Answer 1

使用以下代碼從經過Google訓練的word2vec模型中提取單詞向量：

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

# this line doesn't load the trained model 
from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport']

# this is how you load the model
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)

# to extract word vector
print(model[words[0]])  #access

結果向量：

[ -8.74023438e-02  -1.86523438e-01 .. ]

由於模型過大，系統處於凍結狀態。 嘗試使用內存更大的系統，或者可以限制正在加載的模型的大小。

加載時限制模型大小

model = KeyedVectors.load_word2vec_format(path_to_model, binary=True, limit=20000)

如何從word2vec的Google預訓練模型中提取單詞向量？

問題描述

1 個解決方案

解決方案1
6 已采納 2017-06-22 09:01:44

如何從word2vec的Google預訓練模型中提取單詞向量？

問題描述

1 個解決方案

解決方案1 6 已采納 2017-06-22 09:01:44

解決方案1
6 已采納 2017-06-22 09:01:44