繁体   English   中英

如何从word2vec的Google预训练模型中提取单词向量?

[英]How to extract a word vector from the Google pre-trained model for word2vec?

GoogleNews-vectors-negative300.bin文件包含3亿个单词向量。 我认为(不确定)在编写以下行时已加载此文件:

from gensim.models.keyedvectors import KeyedVectors

我想下载我从外部提供的words列表中的words vectors。 这是我执行此操作的代码:

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'ele', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'landuse', 'layer', 'leisure', 'man', 'manmade', 'market', 'marketplace', 'maxheight', 'name', 'natural', 'noexit', 'oneway', 'park', 'parking', 'pgs', 'place', 'worship', 'playground', 'police', 'police station', '', 'post', 'post box or mail', 'power', 'powerstation', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'toilets', 'tourism', 'unknown', 'vehicle', 'vending', 'vending machine', 'village', 'wall', 'waste', 'water', 'waterway', 'worship'];

model = gensim.models.KeyedVectors.load_word2vec_format(words, binary=True)

M = len(words)
count = 0
for i in range(1,M):
    wi = id2word[words[i]]
    if wi in word2vec.vocab:
        vector[:,count] = model[:,i]
        count = count+1

f = open('word_vectors.csv', 'w')
print(vector, file=f)
f.close()

但是,当我运行代码时,它只会冻结系统。 是否因为在搜索单词中的words之前加载了整个二进制文件? 如果是,我该如何解决这个问题? 我收到以下警告时就想到了这一点,这就是为什么我使用warning包来禁止它的原因:

c:\Python35\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

它给出的错误是:

Traceback (most recent call last):
  File "word2vec.py", line 18, in <module>
    model = gensim.models.KeyedVectors.load_word2vec_format(topic, binary=True) 
  File "c:\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 196, in load_word2vec_format
    with utils.smart_open(fname) as fin:
  File "c:\Python35\lib\site-packages\smart_open\smart_open_lib.py", line 208, in smart_open
    raise TypeError('don\'t know how to handle uri %s' % repr(uri))
TypeError: don't know how to handle uri [['access'], ['aeroway'], ['airport'], ['amenity'], ['area'], ['atm'], ['barrier'], ['bay'], ['bench'], ['boundary'], ['bridge'], ['building'], ['bus'], ['cafe'], ['car'], ['coast'], ['continue'], ['created'], ['defibrillator'], ['drinking'], ['ele'], ['embankment'], ['entrance'], ['ferry'], ['foot'], ['fountain'], ['fuel'], ['gate'], ['golf'], ['gps'], ['grave'], ['highway'], ['horse'], ['hospital'], ['house'], ['landuse'], ['layer'], ['leisure'], ['man'], ['manmade'], ['market'], ['marketplace'], ['maxheight'], ['name'], ['natural'], ['noexit'], ['oneway'], ['park'], ['parking'], ['pgs'], ['place'], ['worship'], ['playground'], ['police'], ['police station'], [''], ['post'], ['post box or mail'], ['power'], ['powerstation'], ['private'], ['public'], ['railway'], ['ref'], ['residential'], ['restaurant'], ['road'], ['route'], ['school'], ['shelter'], ['shop'], ['source'], ['sport'], ['toilet'], ['toilets'], ['tourism'], ['unknown'], ['vehicle'], ['vending'], ['vending machine'], ['village'], ['wall'], ['waste'], ['water'], ['waterway'], ['worship']]

我猜这意味着程序无法在二进制文件中搜索单词。 那么,如何解决呢?

使用以下代码从经过Google训练的word2vec模型中提取单词向量:

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

# this line doesn't load the trained model 
from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport']

# this is how you load the model
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)

# to extract word vector
print(model[words[0]])  #access

结果向量:

[ -8.74023438e-02  -1.86523438e-01 .. ]

由于模型过大,系统处于冻结状态。 尝试使用内存更大的系统,或者可以限制正在加载的模型的大小。

加载时限制模型大小

model = KeyedVectors.load_word2vec_format(path_to_model, binary=True, limit=20000)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM