如何从word2vec的Google预训练模型中提取单词向量？

Question

GoogleNews-vectors-negative300.bin文件包含3亿个单词向量。 我认为（不确定）在编写以下行时已加载此文件：

from gensim.models.keyedvectors import KeyedVectors

我想下载我从外部提供的words列表中的words vectors。 这是我执行此操作的代码：

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'ele', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'landuse', 'layer', 'leisure', 'man', 'manmade', 'market', 'marketplace', 'maxheight', 'name', 'natural', 'noexit', 'oneway', 'park', 'parking', 'pgs', 'place', 'worship', 'playground', 'police', 'police station', '', 'post', 'post box or mail', 'power', 'powerstation', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'toilets', 'tourism', 'unknown', 'vehicle', 'vending', 'vending machine', 'village', 'wall', 'waste', 'water', 'waterway', 'worship'];

model = gensim.models.KeyedVectors.load_word2vec_format(words, binary=True)

M = len(words)
count = 0
for i in range(1,M):
    wi = id2word[words[i]]
    if wi in word2vec.vocab:
        vector[:,count] = model[:,i]
        count = count+1

f = open('word_vectors.csv', 'w')
print(vector, file=f)
f.close()

但是，当我运行代码时，它只会冻结系统。 是否因为在搜索单词中的words之前加载了整个二进制文件？ 如果是，我该如何解决这个问题？ 我收到以下警告时就想到了这一点，这就是为什么我使用warning包来禁止它的原因：

c:\Python35\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

它给出的错误是：

Traceback (most recent call last):
  File "word2vec.py", line 18, in <module>
    model = gensim.models.KeyedVectors.load_word2vec_format(topic, binary=True) 
  File "c:\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 196, in load_word2vec_format
    with utils.smart_open(fname) as fin:
  File "c:\Python35\lib\site-packages\smart_open\smart_open_lib.py", line 208, in smart_open
    raise TypeError('don\'t know how to handle uri %s' % repr(uri))
TypeError: don't know how to handle uri [['access'], ['aeroway'], ['airport'], ['amenity'], ['area'], ['atm'], ['barrier'], ['bay'], ['bench'], ['boundary'], ['bridge'], ['building'], ['bus'], ['cafe'], ['car'], ['coast'], ['continue'], ['created'], ['defibrillator'], ['drinking'], ['ele'], ['embankment'], ['entrance'], ['ferry'], ['foot'], ['fountain'], ['fuel'], ['gate'], ['golf'], ['gps'], ['grave'], ['highway'], ['horse'], ['hospital'], ['house'], ['landuse'], ['layer'], ['leisure'], ['man'], ['manmade'], ['market'], ['marketplace'], ['maxheight'], ['name'], ['natural'], ['noexit'], ['oneway'], ['park'], ['parking'], ['pgs'], ['place'], ['worship'], ['playground'], ['police'], ['police station'], [''], ['post'], ['post box or mail'], ['power'], ['powerstation'], ['private'], ['public'], ['railway'], ['ref'], ['residential'], ['restaurant'], ['road'], ['route'], ['school'], ['shelter'], ['shop'], ['source'], ['sport'], ['toilet'], ['toilets'], ['tourism'], ['unknown'], ['vehicle'], ['vending'], ['vending machine'], ['village'], ['wall'], ['waste'], ['water'], ['waterway'], ['worship']]

我猜这意味着程序无法在二进制文件中搜索单词。 那么，如何解决呢？

Answer 1

使用以下代码从经过Google训练的word2vec模型中提取单词向量：

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

# this line doesn't load the trained model 
from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport']

# this is how you load the model
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)

# to extract word vector
print(model[words[0]])  #access

结果向量：

[ -8.74023438e-02  -1.86523438e-01 .. ]

由于模型过大，系统处于冻结状态。 尝试使用内存更大的系统，或者可以限制正在加载的模型的大小。

加载时限制模型大小

model = KeyedVectors.load_word2vec_format(path_to_model, binary=True, limit=20000)

如何从word2vec的Google预训练模型中提取单词向量？

问题描述

1 个解决方案

解决方案1
6 已采纳 2017-06-22 09:01:44

如何从word2vec的Google预训练模型中提取单词向量？

问题描述

1 个解决方案

解决方案1 6 已采纳 2017-06-22 09:01:44

解决方案1
6 已采纳 2017-06-22 09:01:44