[英]How to extract a word vector from the Google pre-trained model for word2vec?
GoogleNews-vectors-negative300.bin
文件包含3億個單詞向量。 我認為(不確定)在編寫以下行時已加載此文件:
from gensim.models.keyedvectors import KeyedVectors
我想下載我從外部提供的words
列表中的words
vectors。 這是我執行此操作的代碼:
import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models.keyedvectors import KeyedVectors
words = ['access', 'aeroway', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'ele', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'landuse', 'layer', 'leisure', 'man', 'manmade', 'market', 'marketplace', 'maxheight', 'name', 'natural', 'noexit', 'oneway', 'park', 'parking', 'pgs', 'place', 'worship', 'playground', 'police', 'police station', '', 'post', 'post box or mail', 'power', 'powerstation', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'toilets', 'tourism', 'unknown', 'vehicle', 'vending', 'vending machine', 'village', 'wall', 'waste', 'water', 'waterway', 'worship'];
model = gensim.models.KeyedVectors.load_word2vec_format(words, binary=True)
M = len(words)
count = 0
for i in range(1,M):
wi = id2word[words[i]]
if wi in word2vec.vocab:
vector[:,count] = model[:,i]
count = count+1
f = open('word_vectors.csv', 'w')
print(vector, file=f)
f.close()
但是,當我運行代碼時,它只會凍結系統。 是否因為在搜索單詞中的words
之前加載了整個二進制文件? 如果是,我該如何解決這個問題? 我收到以下警告時就想到了這一點,這就是為什么我使用warning
包來禁止它的原因:
c:\Python35\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
它給出的錯誤是:
Traceback (most recent call last):
File "word2vec.py", line 18, in <module>
model = gensim.models.KeyedVectors.load_word2vec_format(topic, binary=True)
File "c:\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 196, in load_word2vec_format
with utils.smart_open(fname) as fin:
File "c:\Python35\lib\site-packages\smart_open\smart_open_lib.py", line 208, in smart_open
raise TypeError('don\'t know how to handle uri %s' % repr(uri))
TypeError: don't know how to handle uri [['access'], ['aeroway'], ['airport'], ['amenity'], ['area'], ['atm'], ['barrier'], ['bay'], ['bench'], ['boundary'], ['bridge'], ['building'], ['bus'], ['cafe'], ['car'], ['coast'], ['continue'], ['created'], ['defibrillator'], ['drinking'], ['ele'], ['embankment'], ['entrance'], ['ferry'], ['foot'], ['fountain'], ['fuel'], ['gate'], ['golf'], ['gps'], ['grave'], ['highway'], ['horse'], ['hospital'], ['house'], ['landuse'], ['layer'], ['leisure'], ['man'], ['manmade'], ['market'], ['marketplace'], ['maxheight'], ['name'], ['natural'], ['noexit'], ['oneway'], ['park'], ['parking'], ['pgs'], ['place'], ['worship'], ['playground'], ['police'], ['police station'], [''], ['post'], ['post box or mail'], ['power'], ['powerstation'], ['private'], ['public'], ['railway'], ['ref'], ['residential'], ['restaurant'], ['road'], ['route'], ['school'], ['shelter'], ['shop'], ['source'], ['sport'], ['toilet'], ['toilets'], ['tourism'], ['unknown'], ['vehicle'], ['vending'], ['vending machine'], ['village'], ['wall'], ['waste'], ['water'], ['waterway'], ['worship']]
我猜這意味着程序無法在二進制文件中搜索單詞。 那么,如何解決呢?
使用以下代碼從經過Google訓練的word2vec模型中提取單詞向量:
import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
# this line doesn't load the trained model
from gensim.models.keyedvectors import KeyedVectors
words = ['access', 'aeroway', 'airport']
# this is how you load the model
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
# to extract word vector
print(model[words[0]]) #access
結果向量:
[ -8.74023438e-02 -1.86523438e-01 .. ]
由於模型過大,系統處於凍結狀態。 嘗試使用內存更大的系統,或者可以限制正在加載的模型的大小。
加載時限制模型大小
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True, limit=20000)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.