简体   繁体   中英

Doc2Vec model Python 3 compatibility

I trained a doc2vec model with Python2 and I would like to use it in Python3.

When I try to load it in Python 3, I get :

Doc2Vec.load('my_doc2vec.pkl')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 0: ordinal not in range(128)

It seems to be related to a pickle compatibility issue, which I tried to solve by doing :

with open('my_doc2vec.pkl', 'rb') as inf:
    data = pickle.load(inf)
data.save('my_doc2vec_python3.pkl')

Gensim saved other files which I renamed as well so they can be found when calling

de = Doc2Vec.load('my_doc2vec_python3.pkl')

The load() does not fail with UnicodeDecodeError but after the inference provides meaningless results.

I can't easily re-train it using Gensim in Python 3 as I used this model to create derived data from it, so I would have to re-run a long and complex pipeline.

How can I make the doc2vec model compatible with Python 3?

Answering my own question, this answer worked for me.

Here are the steps a bit more details :

  1. download gensim source code, eg clone from repo
  2. in gensim/utils.py, edit the method unpickle to add the encoding parameter:

      return _pickle.loads(f.read(), encoding='latin1') 
  3. using Python 3 and the modified gensim, load the model:

     de = Doc2Vec.load('my_doc2vec.pkl') 
  4. save it:

     de.save('my_doc2vec_python3.pkl') 

This model should be now loadable in Python 3 with unmodified gensim.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM