[英]Load word vectors from Gensim to SpaCy Vectors class
As the title says, I would like to load custom word vectors built from gensim
to the SpaCy
Vector class.正如标题所说,我想加载从
gensim
构建的自定义词向量到SpaCy
Vector 类。
I have found several other questions where folks have successfully loaded vectors to the nlp
object itself, but I have a current project where I would like to have a separate Vectors object.我发现了其他几个问题,人们已经成功地将向量加载到
nlp
对象本身,但我有一个当前项目,我希望有一个单独的 Vectors 对象。
Specifically, I am using BioWordVec to generate my word vectors which serializes the vectors using methods from gensim.models.Fastext
.具体来说,我正在使用 BioWordVec 生成我的词向量,它使用
gensim.models.Fastext
中的方法序列化向量。
On the gensim
end I am:在
gensim
端我是:
model.wv.save_word2vec_format(output/bin/path, binary=True)
model.wv.save_word2vec_format(output/bin/path, binary=True)
model.save(path/to/model)
model.save(path/to/model)
On the SpaCy
side:在
SpaCy
方面:
from_disk
or from_bytes
methods to load the word vectorsfrom_disk
或from_bytes
方法来加载词向量from_glove
method that expects a vocab.txt file and a binary file (which I already have a binary filefrom_glove
方法需要一个 vocab.txt 文件和一个二进制文件(我已经有一个二进制文件Link to Vectors Documentation链接到矢量文档
just for reference, here is my code to test the load process:仅供参考,这是我测试加载过程的代码:
import spacy
from spacy.vectors import Vectors
vecs = Vectors()
path = '/home/medmison690/pyprojects/BioWordVec/pubmed_mesh_test.bin'
dir_path = '/home/medmison690/Desktop/tuned_vecs'
vecs.from_disk(dir_path)
print(vecs.shape)
I have tried various combinations of from_disk
and from_bytes
with no success.我尝试了
from_disk
和from_bytes
的各种组合,但都没有成功。 Any help or advice would be greatly appreciated!任何帮助或建议将不胜感激!
It's unfortunate the Spacy docs don't clearly state what formats are used by its various reading functions, nor implement an import that's clearly based on the format written by the original Google word2vec.c
code.不幸的是,Spacy 文档没有明确说明其各种阅读功能使用的格式,也没有实施明显基于原始 Google
word2vec.c
代码编写的格式的导入。
It seems the from_disk
expects things in Spacy's own multi-file format.似乎
from_disk
期望 Spacy 自己的多文件格式的东西。 The from_bytes
might expect a raw version of the vectors. from_bytes
可能需要向量的原始版本。 Neither would be useful for data saved from gensim
's FastText
model.对于从
gensim
的FastText
模型保存的数据,两者都没有用。
The from_glove
might in fact be a compatible format. from_glove
实际上可能是一种兼容格式。 You could try using the save_word2vec_format()
method with its optional fvocab
argument (to specify a vocab.txt
file with words), binary=True
, and a filename according to Spacy's conventions.您可以尝试使用
save_word2vec_format()
方法及其可选的fvocab
参数(以指定包含单词的vocab.txt
文件)、 binary=True
和符合 Spacy 约定的文件名。 For example, if you have 300 dimensional vectors:例如,如果您有 300 维向量:
ft_model.wv.save_word2vec_format('vectors.300.f.bin', fvocab='vocab.txt', binary=True)
Then, see if that directory works for Spacy's from_glove
.然后,查看该目录是否适用于 Spacy 的
from_glove
。 (I'm not sure it will.) (我不确定它会。)
Alternatively, you could possibly use a gensim
utility class (such as its KeyedVectors
) to load the vectors into memory, then manually add each vector, one-by-one, into a pre-allocated Spacy Vectors
object.或者,您可以使用
gensim
实用程序类(例如它的KeyedVectors
)将向量加载到内存中,然后手动将每个向量逐个添加到预分配的 Spacy Vectors
对象中。
Note that by saving FastText vectors to the plain, vectors-only word2vec_format
, you'll be losing anything the model learned about subwords (which is what FastText-capable models use to synthesize vectors for out-of-vocabulary words).请注意,通过将 FastText 向量保存为普通的、仅包含向量的
word2vec_format
,您将丢失模型学到的关于子词的所有内容(这是支持 FastText 的模型用来为词汇表外的词合成向量的东西)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.