I've produced GloVe vectors using the code provided by https://github.com/stanfordnlp/GloVe/blob/master/demo.sh using my own corpus. So, I have both the.bin file and.txt file vectors. I'm trying to import these files into gensim so I can work with them like I can word2vec vectors.
I've tried changing to load using both the binary format and text file format but only ended up getting a pickling error:
models = gensim.models.Word2Vec.load(file)
I've tried ignoring the unicode error, which didn't work. I still got the unicode error.
model = gensim.models.KeyedVectors.load_word2vec_format(file, binary=True, unicode_errors='ignore')
This is what I have for my code right now:
from gensim.models import KeyedVectors
import gensim
from gensim.models import word2vec
file = 'vectors.bin'
model = KeyedVectors.load_word2vec_format(file, binary=True, unicode_errors='ignore')
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
This is the error message I keep getting:
Traceback (most recent call last):
File "glove_to_word2vec.py", line 6, in <module>
model = KeyedVectors.load_word2vec_format(file, binary=True) # C binary format
File "/home/users/epair/.local/lib/python3.6/site- packages/gensim/models/keyedvectors.py", line 1498, in load_word2vec_format
limit=limit, datatype=datatype)
File "/home/users/epair/.local/lib/python3.6/site-packages/gensim/models/utils_any2vec.py", line 343, in _load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)
File "/home/users/epair/.local/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte
The pickling error was something like this: Unpickling Error while using Word2Vec.load()
There's no expectation a plain .load()
would work – that will only work with gensim
's own models, saved with the matching .save()
method.
However, .load_word2vec_format()
should work with files in the right format.
Are you sure the file is in a compatible format? (Does it load into the original Google word2vec.c
sibling tools, like the distance
or word-analogy
executables?)
You mentioned having the .txt
format as well – have you tried loading that file (with binary=False
)?
Looking at line 343 of utils_any2vec.py
(in a version of gensim
you're likely using), that appears to be reading the very 1st line of the file, which should only have 2 plain space-separated numbers on it: the number of words, and the number of dimensions. (That is, encoding issues with regard to your actual word-tokens shouldn't even be involved.)
If you look at your file with head -1 vectors.txt
, is that all you see? (If not, your GLoVe
code isn't writing the right compatible format.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.