As the title says, I would like to load custom word vectors built from gensim
to the SpaCy
Vector class.
I have found several other questions where folks have successfully loaded vectors to the nlp
object itself, but I have a current project where I would like to have a separate Vectors object.
Specifically, I am using BioWordVec to generate my word vectors which serializes the vectors using methods from gensim.models.Fastext
.
On the gensim
end I am:
model.wv.save_word2vec_format(output/bin/path, binary=True)
model.save(path/to/model)
On the SpaCy
side:
from_disk
or from_bytes
methods to load the word vectorsfrom_glove
method that expects a vocab.txt file and a binary file (which I already have a binary fileLink to Vectors Documentation
just for reference, here is my code to test the load process:
import spacy
from spacy.vectors import Vectors
vecs = Vectors()
path = '/home/medmison690/pyprojects/BioWordVec/pubmed_mesh_test.bin'
dir_path = '/home/medmison690/Desktop/tuned_vecs'
vecs.from_disk(dir_path)
print(vecs.shape)
I have tried various combinations of from_disk
and from_bytes
with no success. Any help or advice would be greatly appreciated!
It's unfortunate the Spacy docs don't clearly state what formats are used by its various reading functions, nor implement an import that's clearly based on the format written by the original Google word2vec.c
code.
It seems the from_disk
expects things in Spacy's own multi-file format. The from_bytes
might expect a raw version of the vectors. Neither would be useful for data saved from gensim
's FastText
model.
The from_glove
might in fact be a compatible format. You could try using the save_word2vec_format()
method with its optional fvocab
argument (to specify a vocab.txt
file with words), binary=True
, and a filename according to Spacy's conventions. For example, if you have 300 dimensional vectors:
ft_model.wv.save_word2vec_format('vectors.300.f.bin', fvocab='vocab.txt', binary=True)
Then, see if that directory works for Spacy's from_glove
. (I'm not sure it will.)
Alternatively, you could possibly use a gensim
utility class (such as its KeyedVectors
) to load the vectors into memory, then manually add each vector, one-by-one, into a pre-allocated Spacy Vectors
object.
Note that by saving FastText vectors to the plain, vectors-only word2vec_format
, you'll be losing anything the model learned about subwords (which is what FastText-capable models use to synthesize vectors for out-of-vocabulary words).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.