使用映射重命名gensim Word2Vec单词

Question

I want to replace the words of my gensim Word2Vec model with a mapping. 我想用映射替换我的gensim Word2Vec模型的单词。

Example 例

My current model has the word 'foo' that maps to a vector: 我当前的模型有'foo'这个词映射到一个向量：

>>> model['foo']
[1.0 0.0]

I have the mapping: d = {'foo': 'bar', ...} 我有映射： d = {'foo': 'bar', ...}

How can I rebuild the model with this new mapping such that 如何使用这个新映射重建模型

>>> model['bar']  # in place of 'foo'
[1.0 0.0]

Answer 1

One solution is to save the model in the C-based word2vec format and replace the original words with a mapping of the new words using awk . 一种解决方案是将模型保存为基于C的word2vec格式，并使用awk替换原始单词和新单词的映射。

Assume we have a file mapping of the form: 假设我们有一个表单的文件映射：

$ cat map.txt
foo:bar
...

We can recreate the model via: 我们可以通过以下方式重建模型：

import subprocess as sp
import shlex

from gensim.models import Word2Vec

model.save_word2vec_format('embeddings.txt', binary=False)

CMD = r"""
awk -F'[ ]|:' 'FNR==NR {a[$1]=$2; next} FNR==1{print $0} FNR!=1{$1=a[$1]; print $0}' map.txt embeddings.txt
"""

with open('new_embeddings.txt', 'w') as f:
    p = sp.Popen(shlex.split(CMD), stdout=f)

new_model = Word2Vec.load_word2vec_format('new_embeddings.txt')

new_model.create_binary_tree()

As an aside my mapping was actually an array where I was training on the index of the word in some array arr . 另外，我的映射实际上是一个数组，我在一些数组arr训练单词的索引。 I created the map file using numpy: 我使用numpy创建了地图文件：

import numpy as np

np.savetxt('map.txt', np.c_[np.arange(arr.size), arr], '%d:%s')

使用映射重命名gensim Word2Vec单词

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-12-03 04:05:44

使用映射重命名gensim Word2Vec单词

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-12-03 04:05:44

解决方案1
0 已采纳 2016-12-03 04:05:44