简体   繁体   中英

Load word vectors from Gensim to SpaCy Vectors class

As the title says, I would like to load custom word vectors built from gensim to the SpaCy Vector class.

I have found several other questions where folks have successfully loaded vectors to the nlp object itself, but I have a current project where I would like to have a separate Vectors object.

Specifically, I am using BioWordVec to generate my word vectors which serializes the vectors using methods from gensim.models.Fastext .

On the gensim end I am:

  • calling model.wv.save_word2vec_format(output/bin/path, binary=True)
  • saving the model -> model.save(path/to/model)

On the SpaCy side:

  • I can either use the from_disk or from_bytes methods to load the word vectors
  • there is also a from_glove method that expects a vocab.txt file and a binary file (which I already have a binary file

Link to Vectors Documentation

just for reference, here is my code to test the load process:

import spacy
from spacy.vectors import Vectors 

vecs = Vectors()
path = '/home/medmison690/pyprojects/BioWordVec/pubmed_mesh_test.bin'
dir_path = '/home/medmison690/Desktop/tuned_vecs'


vecs.from_disk(dir_path)


print(vecs.shape)

I have tried various combinations of from_disk and from_bytes with no success. Any help or advice would be greatly appreciated!

It's unfortunate the Spacy docs don't clearly state what formats are used by its various reading functions, nor implement an import that's clearly based on the format written by the original Google word2vec.c code.

It seems the from_disk expects things in Spacy's own multi-file format. The from_bytes might expect a raw version of the vectors. Neither would be useful for data saved from gensim 's FastText model.

The from_glove might in fact be a compatible format. You could try using the save_word2vec_format() method with its optional fvocab argument (to specify a vocab.txt file with words), binary=True , and a filename according to Spacy's conventions. For example, if you have 300 dimensional vectors:

ft_model.wv.save_word2vec_format('vectors.300.f.bin', fvocab='vocab.txt', binary=True)

Then, see if that directory works for Spacy's from_glove . (I'm not sure it will.)

Alternatively, you could possibly use a gensim utility class (such as its KeyedVectors ) to load the vectors into memory, then manually add each vector, one-by-one, into a pre-allocated Spacy Vectors object.

Note that by saving FastText vectors to the plain, vectors-only word2vec_format , you'll be losing anything the model learned about subwords (which is what FastText-capable models use to synthesize vectors for out-of-vocabulary words).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM