[英]How to extract a word vector from the Google pre-trained model for word2vec?
The file GoogleNews-vectors-negative300.bin
contains 300 million word-vectors. GoogleNews-vectors-negative300.bin
文件包含3亿个单词向量。 I think (not sure) this file is loaded when the following line is written: 我认为(不确定)在编写以下行时已加载此文件:
from gensim.models.keyedvectors import KeyedVectors
I want to download the vectors for words that I give externally in a list called words
. 我想下载我从外部提供的
words
列表中的words
vectors。 This is my code to do this: 这是我执行此操作的代码:
import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models.keyedvectors import KeyedVectors
words = ['access', 'aeroway', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'ele', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'landuse', 'layer', 'leisure', 'man', 'manmade', 'market', 'marketplace', 'maxheight', 'name', 'natural', 'noexit', 'oneway', 'park', 'parking', 'pgs', 'place', 'worship', 'playground', 'police', 'police station', '', 'post', 'post box or mail', 'power', 'powerstation', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'toilets', 'tourism', 'unknown', 'vehicle', 'vending', 'vending machine', 'village', 'wall', 'waste', 'water', 'waterway', 'worship'];
model = gensim.models.KeyedVectors.load_word2vec_format(words, binary=True)
M = len(words)
count = 0
for i in range(1,M):
wi = id2word[words[i]]
if wi in word2vec.vocab:
vector[:,count] = model[:,i]
count = count+1
f = open('word_vectors.csv', 'w')
print(vector, file=f)
f.close()
But when I run the code, it just freezes up my system. 但是,当我运行代码时,它只会冻结系统。 Is it because it is loading the whole of the binary file before searching for the words in
words
? 是否因为在搜索单词中的
words
之前加载了整个二进制文件? If yes, how do I get around this issue? 如果是,我该如何解决这个问题? I think of this as I get the following warning, which is why I use the
warning
package to suppress it: 我收到以下警告时就想到了这一点,这就是为什么我使用
warning
包来禁止它的原因:
c:\Python35\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
And the error it gives is: 它给出的错误是:
Traceback (most recent call last):
File "word2vec.py", line 18, in <module>
model = gensim.models.KeyedVectors.load_word2vec_format(topic, binary=True)
File "c:\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 196, in load_word2vec_format
with utils.smart_open(fname) as fin:
File "c:\Python35\lib\site-packages\smart_open\smart_open_lib.py", line 208, in smart_open
raise TypeError('don\'t know how to handle uri %s' % repr(uri))
TypeError: don't know how to handle uri [['access'], ['aeroway'], ['airport'], ['amenity'], ['area'], ['atm'], ['barrier'], ['bay'], ['bench'], ['boundary'], ['bridge'], ['building'], ['bus'], ['cafe'], ['car'], ['coast'], ['continue'], ['created'], ['defibrillator'], ['drinking'], ['ele'], ['embankment'], ['entrance'], ['ferry'], ['foot'], ['fountain'], ['fuel'], ['gate'], ['golf'], ['gps'], ['grave'], ['highway'], ['horse'], ['hospital'], ['house'], ['landuse'], ['layer'], ['leisure'], ['man'], ['manmade'], ['market'], ['marketplace'], ['maxheight'], ['name'], ['natural'], ['noexit'], ['oneway'], ['park'], ['parking'], ['pgs'], ['place'], ['worship'], ['playground'], ['police'], ['police station'], [''], ['post'], ['post box or mail'], ['power'], ['powerstation'], ['private'], ['public'], ['railway'], ['ref'], ['residential'], ['restaurant'], ['road'], ['route'], ['school'], ['shelter'], ['shop'], ['source'], ['sport'], ['toilet'], ['toilets'], ['tourism'], ['unknown'], ['vehicle'], ['vending'], ['vending machine'], ['village'], ['wall'], ['waste'], ['water'], ['waterway'], ['worship']]
This I guess means that the program is not able to search for the words in the binary file. 我猜这意味着程序无法在二进制文件中搜索单词。 So, how to solve it?
那么,如何解决呢?
Use the following code to extract the word vector from the Google trained model for word2vec: 使用以下代码从经过Google训练的word2vec模型中提取单词向量:
import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
# this line doesn't load the trained model
from gensim.models.keyedvectors import KeyedVectors
words = ['access', 'aeroway', 'airport']
# this is how you load the model
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
# to extract word vector
print(model[words[0]]) #access
Result vector: 结果向量:
[ -8.74023438e-02 -1.86523438e-01 .. ]
Your system is freezing because of the large size of model. 由于模型过大,系统处于冻结状态。 Try using system with more memory or you can limit the size of model you are loading.
尝试使用内存更大的系统,或者可以限制正在加载的模型的大小。
Limit model size while loading 加载时限制模型大小
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True, limit=20000)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.