简体   繁体   English

如何从word2vec的Google预训练模型中提取单词向量?

[英]How to extract a word vector from the Google pre-trained model for word2vec?

The file GoogleNews-vectors-negative300.bin contains 300 million word-vectors. GoogleNews-vectors-negative300.bin文件包含3亿个单词向量。 I think (not sure) this file is loaded when the following line is written: 我认为(不确定)在编写以下行时已加载此文件:

from gensim.models.keyedvectors import KeyedVectors

I want to download the vectors for words that I give externally in a list called words . 我想下载我从外部提供的words列表中的words vectors。 This is my code to do this: 这是我执行此操作的代码:

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'ele', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'landuse', 'layer', 'leisure', 'man', 'manmade', 'market', 'marketplace', 'maxheight', 'name', 'natural', 'noexit', 'oneway', 'park', 'parking', 'pgs', 'place', 'worship', 'playground', 'police', 'police station', '', 'post', 'post box or mail', 'power', 'powerstation', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'toilets', 'tourism', 'unknown', 'vehicle', 'vending', 'vending machine', 'village', 'wall', 'waste', 'water', 'waterway', 'worship'];

model = gensim.models.KeyedVectors.load_word2vec_format(words, binary=True)

M = len(words)
count = 0
for i in range(1,M):
    wi = id2word[words[i]]
    if wi in word2vec.vocab:
        vector[:,count] = model[:,i]
        count = count+1

f = open('word_vectors.csv', 'w')
print(vector, file=f)
f.close()

But when I run the code, it just freezes up my system. 但是,当我运行代码时,它只会冻结系统。 Is it because it is loading the whole of the binary file before searching for the words in words ? 是否因为在搜索单词中的words之前加载了整个二进制文件? If yes, how do I get around this issue? 如果是,我该如何解决这个问题? I think of this as I get the following warning, which is why I use the warning package to suppress it: 我收到以下警告时就想到了这一点,这就是为什么我使用warning包来禁止它的原因:

c:\Python35\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

And the error it gives is: 它给出的错误是:

Traceback (most recent call last):
  File "word2vec.py", line 18, in <module>
    model = gensim.models.KeyedVectors.load_word2vec_format(topic, binary=True) 
  File "c:\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 196, in load_word2vec_format
    with utils.smart_open(fname) as fin:
  File "c:\Python35\lib\site-packages\smart_open\smart_open_lib.py", line 208, in smart_open
    raise TypeError('don\'t know how to handle uri %s' % repr(uri))
TypeError: don't know how to handle uri [['access'], ['aeroway'], ['airport'], ['amenity'], ['area'], ['atm'], ['barrier'], ['bay'], ['bench'], ['boundary'], ['bridge'], ['building'], ['bus'], ['cafe'], ['car'], ['coast'], ['continue'], ['created'], ['defibrillator'], ['drinking'], ['ele'], ['embankment'], ['entrance'], ['ferry'], ['foot'], ['fountain'], ['fuel'], ['gate'], ['golf'], ['gps'], ['grave'], ['highway'], ['horse'], ['hospital'], ['house'], ['landuse'], ['layer'], ['leisure'], ['man'], ['manmade'], ['market'], ['marketplace'], ['maxheight'], ['name'], ['natural'], ['noexit'], ['oneway'], ['park'], ['parking'], ['pgs'], ['place'], ['worship'], ['playground'], ['police'], ['police station'], [''], ['post'], ['post box or mail'], ['power'], ['powerstation'], ['private'], ['public'], ['railway'], ['ref'], ['residential'], ['restaurant'], ['road'], ['route'], ['school'], ['shelter'], ['shop'], ['source'], ['sport'], ['toilet'], ['toilets'], ['tourism'], ['unknown'], ['vehicle'], ['vending'], ['vending machine'], ['village'], ['wall'], ['waste'], ['water'], ['waterway'], ['worship']]

This I guess means that the program is not able to search for the words in the binary file. 我猜这意味着程序无法在二进制文件中搜索单词。 So, how to solve it? 那么,如何解决呢?

Use the following code to extract the word vector from the Google trained model for word2vec: 使用以下代码从经过Google训练的word2vec模型中提取单词向量:

import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

# this line doesn't load the trained model 
from gensim.models.keyedvectors import KeyedVectors

words = ['access', 'aeroway', 'airport']

# this is how you load the model
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)

# to extract word vector
print(model[words[0]])  #access

Result vector: 结果向量:

[ -8.74023438e-02  -1.86523438e-01 .. ]

Your system is freezing because of the large size of model. 由于模型过大,系统处于冻结状态。 Try using system with more memory or you can limit the size of model you are loading. 尝试使用内存更大的系统,或者可以限制正在加载的模型的大小。

Limit model size while loading 加载时限制模型大小

model = KeyedVectors.load_word2vec_format(path_to_model, binary=True, limit=20000)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何加载预训练的 Word2vec 模型文件? - How to load a pre-trained Word2vec MODEL File? 如何在不手动下载模型的情况下访问/使用Google预先训练的Word2Vec模型? - How to access/use Google's pre-trained Word2Vec model without manually downloading the model? 如何使用预训练的模型权重初始化新的 word2vec 模型? - How to initialize a new word2vec model with pre-trained model weights? 如何加载预训练的 Word2vec MODEL 文件并重用它? - How to load a pre-trained Word2vec MODEL File and reuse it? Gensim 的 Doc2Vec - 如何使用预训练的 word2vec(词相似性) - Gensim's Doc2Vec - How to use pre-trained word2vec (word similarities) Gensim word2vec 扩充或合并预训练向量 - Gensim word2vec augment or merge pre-trained vectors 如何在Keras中将自己的词嵌入与像word2vec这样的预训练嵌入一起使用 - How to use own word embedding with pre-trained embedding like word2vec in Keras Word2Vec:使用 Gensim 上传预训练的 word2vec 文件时收到错误 - Word2Vec: Error received at uploading a pre-trained word2vec file using Gensim 在预训练的 word2vec model 的进程之间共享 memory? - Shared memory among processes for pre-trained word2vec model? word2vec:具有预训练模型的用户级,文档级嵌入 - word2vec: user-level, document-level embeddings with pre-trained model
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM