简体   繁体   中英

How to load pre-trained glove model with gensim load_word2vec_format?

I am trying to load a pre-trained glove as a word2vec model in gensim. I have downloaded the glove file from here . I am using the following script:

from gensim import models
model = models.KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=True)

but get the following error

ValueError                                Traceback (most recent call last)
<ipython-input-38-e0b48b51f433> in <module>()
      1 from gensim import models
----> 2 model = models.KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=True)

2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/utils_any2vec.py in <genexpr>(.0)
    171     with utils.smart_open(fname) as fin:
    172         header = utils.to_unicode(fin.readline(), encoding=encoding)
--> 173         vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
    174         if limit:
    175             vocab_size = min(vocab_size, limit)

ValueError: invalid literal for int() with base 10: 'the'

What is the underlying problem? Does gensim need a specific format to be able to load it?

The GLoVe format is slightly different – missing a 1st-line declaration of vector-count & dimensions – than the format that load_word2vec_format() supports.

There's a glove2word2vec utility script included you can run once to convert the file:

https://radimrehurek.com/gensim/scripts/glove2word2vec.html

Also, starting in Gensim 4.0.0 (currentlyu in prerelease testing), the load_word2vec_format() method gets a new optional no_header parameter:

https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=load_word2vec_format#gensim.models.keyedvectors.KeyedVectors.load_word2vec_format

If set as no_header=True , the method will deduce the count/dimensions from a preliminary scan of the file - so it can read a GLoVe file with that option – but at the cost of two full-file reads instead of one. (So, you may still want to re-save the object with .save_word2vec_format() , or use the glove2word2vec script, to make future loads faster.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM