如何从word2vec的Google预训练模型中提取单词向量？

Question

The file GoogleNews-vectors-negative300.bin contains 300 million word-vectors. GoogleNews-vectors-negative300.bin文件包含3亿个单词向量。 I think (not sure) this file is loaded when the following line is written: 我认为（不确定）在编写以下行时已加载此文件：

from gensim.models.keyedvectors import KeyedVectors

I want to download the vectors for words that I give externally in a list called words . 我想下载我从外部提供的words列表中的words vectors。 This is my code to do this: 这是我执行此操作的代码：

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'ele', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'landuse', 'layer', 'leisure', 'man', 'manmade', 'market', 'marketplace', 'maxheight', 'name', 'natural', 'noexit', 'oneway', 'park', 'parking', 'pgs', 'place', 'worship', 'playground', 'police', 'police station', '', 'post', 'post box or mail', 'power', 'powerstation', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'toilets', 'tourism', 'unknown', 'vehicle', 'vending', 'vending machine', 'village', 'wall', 'waste', 'water', 'waterway', 'worship'];

model = gensim.models.KeyedVectors.load_word2vec_format(words, binary=True)

M = len(words)
count = 0
for i in range(1,M):
    wi = id2word[words[i]]
    if wi in word2vec.vocab:
        vector[:,count] = model[:,i]
        count = count+1

f = open('word_vectors.csv', 'w')
print(vector, file=f)
f.close()

But when I run the code, it just freezes up my system. 但是，当我运行代码时，它只会冻结系统。 Is it because it is loading the whole of the binary file before searching for the words in words ? 是否因为在搜索单词中的words之前加载了整个二进制文件？ If yes, how do I get around this issue? 如果是，我该如何解决这个问题？ I think of this as I get the following warning, which is why I use the warning package to suppress it: 我收到以下警告时就想到了这一点，这就是为什么我使用warning包来禁止它的原因：

c:\Python35\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

And the error it gives is: 它给出的错误是：

Traceback (most recent call last):
  File "word2vec.py", line 18, in <module>
    model = gensim.models.KeyedVectors.load_word2vec_format(topic, binary=True) 
  File "c:\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 196, in load_word2vec_format
    with utils.smart_open(fname) as fin:
  File "c:\Python35\lib\site-packages\smart_open\smart_open_lib.py", line 208, in smart_open
    raise TypeError('don\'t know how to handle uri %s' % repr(uri))
TypeError: don't know how to handle uri [['access'], ['aeroway'], ['airport'], ['amenity'], ['area'], ['atm'], ['barrier'], ['bay'], ['bench'], ['boundary'], ['bridge'], ['building'], ['bus'], ['cafe'], ['car'], ['coast'], ['continue'], ['created'], ['defibrillator'], ['drinking'], ['ele'], ['embankment'], ['entrance'], ['ferry'], ['foot'], ['fountain'], ['fuel'], ['gate'], ['golf'], ['gps'], ['grave'], ['highway'], ['horse'], ['hospital'], ['house'], ['landuse'], ['layer'], ['leisure'], ['man'], ['manmade'], ['market'], ['marketplace'], ['maxheight'], ['name'], ['natural'], ['noexit'], ['oneway'], ['park'], ['parking'], ['pgs'], ['place'], ['worship'], ['playground'], ['police'], ['police station'], [''], ['post'], ['post box or mail'], ['power'], ['powerstation'], ['private'], ['public'], ['railway'], ['ref'], ['residential'], ['restaurant'], ['road'], ['route'], ['school'], ['shelter'], ['shop'], ['source'], ['sport'], ['toilet'], ['toilets'], ['tourism'], ['unknown'], ['vehicle'], ['vending'], ['vending machine'], ['village'], ['wall'], ['waste'], ['water'], ['waterway'], ['worship']]

This I guess means that the program is not able to search for the words in the binary file. 我猜这意味着程序无法在二进制文件中搜索单词。 So, how to solve it? 那么，如何解决呢？

Answer 1

Use the following code to extract the word vector from the Google trained model for word2vec: 使用以下代码从经过Google训练的word2vec模型中提取单词向量：

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

# this line doesn't load the trained model 
from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport']

# this is how you load the model
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)

# to extract word vector
print(model[words[0]])  #access

Result vector: 结果向量：

[ -8.74023438e-02  -1.86523438e-01 .. ]

Your system is freezing because of the large size of model. 由于模型过大，系统处于冻结状态。 Try using system with more memory or you can limit the size of model you are loading. 尝试使用内存更大的系统，或者可以限制正在加载的模型的大小。

Limit model size while loading 加载时限制模型大小

model = KeyedVectors.load_word2vec_format(path_to_model, binary=True, limit=20000)

如何从word2vec的Google预训练模型中提取单词向量？

问题描述

1 个解决方案

解决方案1
6 已采纳 2017-06-22 09:01:44

如何从word2vec的Google预训练模型中提取单词向量？

问题描述

1 个解决方案

解决方案1 6 已采纳 2017-06-22 09:01:44

解决方案1
6 已采纳 2017-06-22 09:01:44