简体   繁体   English

从 Gensim 加载词向量到 SpaCy Vectors 类

[英]Load word vectors from Gensim to SpaCy Vectors class

As the title says, I would like to load custom word vectors built from gensim to the SpaCy Vector class.正如标题所说,我想加载从gensim构建的自定义词向量到SpaCy Vector 类。

I have found several other questions where folks have successfully loaded vectors to the nlp object itself, but I have a current project where I would like to have a separate Vectors object.我发现了其他几个问题,人们已经成功地将向量加载到nlp对象本身,但我有一个当前项目,我希望有一个单独的 Vectors 对象。

Specifically, I am using BioWordVec to generate my word vectors which serializes the vectors using methods from gensim.models.Fastext .具体来说,我正在使用 BioWordVec 生成我的词向量,它使用gensim.models.Fastext中的方法序列化向量。

On the gensim end I am:gensim端我是:

  • calling model.wv.save_word2vec_format(output/bin/path, binary=True)调用model.wv.save_word2vec_format(output/bin/path, binary=True)
  • saving the model -> model.save(path/to/model)保存模型 -> model.save(path/to/model)

On the SpaCy side:SpaCy方面:

  • I can either use the from_disk or from_bytes methods to load the word vectors我可以使用from_diskfrom_bytes方法来加载词向量
  • there is also a from_glove method that expects a vocab.txt file and a binary file (which I already have a binary file还有一个from_glove方法需要一个 vocab.txt 文件和一个二进制文件(我已经有一个二进制文件

Link to Vectors Documentation链接到矢量文档

just for reference, here is my code to test the load process:仅供参考,这是我测试加载过程的代码:

import spacy
from spacy.vectors import Vectors 

vecs = Vectors()
path = '/home/medmison690/pyprojects/BioWordVec/pubmed_mesh_test.bin'
dir_path = '/home/medmison690/Desktop/tuned_vecs'


vecs.from_disk(dir_path)


print(vecs.shape)

I have tried various combinations of from_disk and from_bytes with no success.我尝试了from_diskfrom_bytes的各种组合,但都没有成功。 Any help or advice would be greatly appreciated!任何帮助或建议将不胜感激!

It's unfortunate the Spacy docs don't clearly state what formats are used by its various reading functions, nor implement an import that's clearly based on the format written by the original Google word2vec.c code.不幸的是,Spacy 文档没有明确说明其各种阅读功能使用的格式,也没有实施明显基于原始 Google word2vec.c代码编写的格式的导入。

It seems the from_disk expects things in Spacy's own multi-file format.似乎from_disk期望 Spacy 自己的多文件格式的东西。 The from_bytes might expect a raw version of the vectors. from_bytes可能需要向量的原始版本。 Neither would be useful for data saved from gensim 's FastText model.对于从gensimFastText模型保存的数据,两者都没有用。

The from_glove might in fact be a compatible format. from_glove实际上可能是一种兼容格式。 You could try using the save_word2vec_format() method with its optional fvocab argument (to specify a vocab.txt file with words), binary=True , and a filename according to Spacy's conventions.您可以尝试使用save_word2vec_format()方法及其可选的fvocab参数(以指定包含单词的vocab.txt文件)、 binary=True和符合 Spacy 约定的文件名。 For example, if you have 300 dimensional vectors:例如,如果您有 300 维向量:

ft_model.wv.save_word2vec_format('vectors.300.f.bin', fvocab='vocab.txt', binary=True)

Then, see if that directory works for Spacy's from_glove .然后,查看该目录是否适用于 Spacy 的from_glove (I'm not sure it will.) (我不确定它会。)

Alternatively, you could possibly use a gensim utility class (such as its KeyedVectors ) to load the vectors into memory, then manually add each vector, one-by-one, into a pre-allocated Spacy Vectors object.或者,您可以使用gensim实用程序类(例如它的KeyedVectors )将向量加载到内存中,然后手动将每个向量逐个添加到预分配的 Spacy Vectors对象中。

Note that by saving FastText vectors to the plain, vectors-only word2vec_format , you'll be losing anything the model learned about subwords (which is what FastText-capable models use to synthesize vectors for out-of-vocabulary words).请注意,通过将 FastText 向量保存为普通的、仅包含向量的word2vec_format ,您将丢失模型学到的关于子词的所有内容(这是支持 FastText 的模型用来为词汇表外的词合成向量的东西)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM