Unable to load gensim Fasttext model - UTF-8 unicode error

Question

I have trained FastText model for french language using Gensim library. Suddenly, this trained model is not getting loaded into memory.

I am using below code :-

from gensim.models import FastText
fname = "filename"
model = FastText.load(fname)

and it throws following error : -

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1070, in load
    model = super(FastText, cls).load(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 1244, in load
    model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 603, in load
    return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/gensim/utils.py", line 426, in load
    obj = unpickle(fname)
  File "/usr/local/lib/python3.7/site-packages/gensim/utils.py", line 1384, in unpickle
    return _pickle.load(f, encoding='latin1')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 14072054: invalid start byte

As this model is trained on large dataset, is there any way to recover/load this model?

Answer 1

This error means that the text stored in your model does not adhere to the utf-8 encoding, as explained here .

A solution with an already trained model would be to set unicode_errors flag when running the model:

from gensim.models import FastText
fname = "filename"
model = FastText.load(fname, unicode_errors='ignore')

This will, however, result in ignoring the words/characters in question, which may not be ideal.

Better would be to re-train the model using utf-8 compliant setup, but it would require re-training.

Unable to load gensim Fasttext model - UTF-8 unicode error

Question

1 answers

solution1
0 2020-09-04 11:25:43

Unable to load gensim Fasttext model - UTF-8 unicode error

Question

1 answers

solution1 0 2020-09-04 11:25:43

solution1
0 2020-09-04 11:25:43