How to save to disk an sklearn model with its out-of-file dependencies?

Question

I want to save to disk an sklearn Pipeline including a custom Preprocessing and a RandomForestClassifier with all the dependencies inside the saved file.. Without this feature, I have to copy all the dependencies (custom modules) in the same folder everywhere I want to call this model (in my case on a remote server).

The preprocessor is defined in a class which lies in an other file ( preprocessing.py ) in the same folder of my project. So I get access to it through an import .

training.py

from preprocessing import Preprocessor

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import pickle

clf = Pipeline([
("preprocessing", Preprocessor()),
("model", RandomForestClassifier())
])

# some fitting of the classifier
# ...

# Export
with open(savepath, "wb") as handle:
    pickle.dump(clf, handle, protocol=pickle.HIGHEST_PROTOCOL)

I tried pickle (and some of its variations), dill and joblib, but that did not work. When I import the .pkl somewhere else (say on my remote server). I must have an identical preprocessing.py in the architecture... which is a pain.

What I would love is to have another file somewhere else :
remote.py

import pickle

with open(savepath, "rb") as handle:
     model = pickle.load(handle)

print(model.predict(some_matrix))

But this code currently gives me an error as it does not find the Preprocessor class...

Answer 1

I'm facing an identical issue right now. To address the same, I am going to turn my pipeline/model along with all it's dependencies(preprocessing classes) into a Python module using setup tools so that it is self contained and can be run anywhere (remote server/docker container/VM.

I'm currently going through this process and if this is something you are interested in, I can respond with the additional steps spelled out as I make progress.

Answer 2

I am not sure what are the tools you are using, but mlflow has a features to address this issue , Which is pretty much saving all the dependency files as a package and when the model is deployed it is done so along with all its dependencies

Following along this post should help

How to save to disk an sklearn model with its out-of-file dependencies?

Question

2 answers

solution1
2 2019-04-17 19:54:49

solution2
0 2021-05-13 01:34:34

How to save to disk an sklearn model with its out-of-file dependencies?

Question

2 answers

solution1 2 2019-04-17 19:54:49

solution2 0 2021-05-13 01:34:34

solution1
2 2019-04-17 19:54:49

solution2
0 2021-05-13 01:34:34