I have trained FastText model for french language using Gensim library. Suddenly, this trained model is not getting loaded into memory.
I am using below code :-
from gensim.models import FastText
fname = "filename"
model = FastText.load(fname)
and it throws following error : -
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1070, in load
model = super(FastText, cls).load(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 1244, in load
model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 603, in load
return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
File "/usr/local/lib/python3.7/site-packages/gensim/utils.py", line 426, in load
obj = unpickle(fname)
File "/usr/local/lib/python3.7/site-packages/gensim/utils.py", line 1384, in unpickle
return _pickle.load(f, encoding='latin1')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 14072054: invalid start byte
As this model is trained on large dataset, is there any way to recover/load this model?
This error means that the text stored in your model does not adhere to the utf-8
encoding, as explained here .
A solution with an already trained model would be to set unicode_errors
flag when running the model:
from gensim.models import FastText
fname = "filename"
model = FastText.load(fname, unicode_errors='ignore')
This will, however, result in ignoring the words/characters in question, which may not be ideal.
Better would be to re-train the model using utf-8
compliant setup, but it would require re-training.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.