mlflow 如何使用自定义转换器保存 sklearn 管道？

Question

I am trying to save with mlflow a sklearn machine-learning model, which is a pipeline containing a custom transformer I have defined, and load it in another project.我试图用 mlflow 保存一个 sklearn 机器学习模型，这是一个包含我定义的自定义转换器的管道，并将其加载到另一个项目中。 My custom transformer inherits from BaseEstimator and TransformerMixin.我的自定义转换器继承自 BaseEstimator 和 TransformerMixin。

Let's say I have 2 projects:假设我有两个项目：

train_project: it has the custom transformers in src.ml.transformers.py train_project：它在 src.ml.transformers.py 中有自定义转换器
use_project: it has other things in src, or has no src catalog at all use_project：它在src中有其他东西，或者根本没有src目录

So in my train_project I do :所以在我的 train_project 我做：

mlflow.sklearn.log_model(preprocess_pipe, 'model/preprocess_pipe')

and then when I try to load it into use_project :然后当我尝试将它加载到 use_project 时：

preprocess_pipe = mlflow.sklearn.load_model(f'{ref_model_path}/preprocess_pipe')

An error occurs :发生错误：

[...]
File "/home/quentin/anaconda3/envs/api_env/lib/python3.7/site-packages/mlflow/sklearn.py", line 210, in _load_model_from_local_file
    return pickle.load(f)
ModuleNotFoundError: No module named 'train_project'

I tried to use format mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE :我尝试使用格式 mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE ：

mlflow.sklearn.log_model(preprocess_pipe, 'model/preprocess_pipe', serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE)

but I get the same error during load.但我在加载过程中遇到了同样的错误。

I saw option code_path into mlflow.pyfunc.log_model but its use and purpose is not clear to me.我在mlflow.pyfunc.log_model 中看到了选项code_path，但我不清楚它的用途和目的。

I thought mlflow provide a easy way to save model and serialize them so they can be used anywhere, Is that true only if you have native sklearn models (or keras, ...)?我认为 mlflow 提供了一种保存模型和序列化它们的简单方法，以便它们可以在任何地方使用，只有当您拥有本机 sklearn 模型（或 keras，...）时，这才是正确的吗？

It's seem that this issue is more related to pickle functioning (mlflow use it and pickle needs to have all dependencies installed).似乎这个问题与pickle功能更相关（mlflow使用它并且pickle需要安装所有依赖项）。

The only solution I found so far is to make my transformer a package, import it in both project.到目前为止，我找到的唯一解决方案是将我的转换器制作成一个包，并将其导入到两个项目中。 Save version of my transformer library with conda_env argument of log_model , and check if it's same version when I load the model into my use_project.使用log_model 的 conda_env参数保存我的转换器库的版本，并在我将模型加载到我的 use_project 时检查它是否是相同的版本。 But it's painfull if I have to change my transformer or debug in it...但是如果我必须改变我的变压器或调试它会很痛苦......

Is anybody have a better solution?有人有更好的解决方案吗？ More elegent?更优雅？ Maybe there is some mlflow functionality I would have missed?也许我会错过一些 mlflow 功能？

other informations :其他信息：
working on linux (ubuntu)在 linux (ubuntu) 上工作
mlflow=1.5.0毫升流=1.5.0
python=3.7.3蟒蛇=3.7.3

I saw in test of mlflow.sklearn api that they do a test with custom transformer, but they load it into the same file so it seems not resolve my issue but maybe it can helps other poeple :我在 mlflow.sklearn api 的测试中看到他们使用自定义转换器进行测试，但他们将其加载到同一个文件中，因此它似乎无法解决我的问题，但也许它可以帮助其他人：

https://github.com/mlflow/mlflow/blob/master/tests/sklearn/test_sklearn_model_export.py https://github.com/mlflow/mlflow/blob/master/tests/sklearn/test_sklearn_model_export.py

Answer 1

You can use the code_path parameter to save Python file dependencies (or directories containing file dependencies).您可以使用code_path参数来保存 Python 文件依赖项（或包含文件依赖项的目录）。 These files are prepended to the system path when the model is loaded.加载模型时，这些文件被添加到系统路径中。 The model folder will contain a directory code which includes all these files.模型文件夹将包含一个目录code ，其中包含所有这些文件。

Answer 2

What you are trying to do is serialize something "customized" that you've trained in a module outside of train.py , correct?您想要做的是序列化您在train.py之外的模块中训练的“自定义”内容，对吗？

What you probably will need to do is log your model with mlflow.pyfunc.log_model with the code argument, which takes in a list of strings containing the path to the modules you will need to deserialize and make predictions, as documented here .你可能会需要做的就是登录你的模型mlflow.pyfunc.log_model与code的说法，这需要在包含路径，您将需要反序列化，并作出预测模块的字符串列表，如记录在这里。

What needs to be clear is that every mlflow model is a PyFunc by nature.需要明确的是，每个 mlflow 模型本质上都是 PyFunc。 Even when you log a model with mlflow.sklearn , you can load it with mlflow.pyfunc.load_model .即使您使用mlflow.sklearn记录模型，您也可以使用mlflow.pyfunc.load_model加载它。 And what a PyFunc does is standardize all models and frameworks in a unique way, that will guarantee you'll always declare how to: PyFunc 所做的是以独特的方式标准化所有模型和框架，这将保证您始终声明如何：

de-serialize your model, with the load_context() method使用load_context()方法反序列化您的模型
make your predictions, with the predict() method使用predict()方法进行predict()

If you make sure about both things in an object that inherits mlflow's PythonModel class, you can then log your model as a PyFunc.如果您确定继承 mlflow 的 PythonModel 类的对象中的这两件事，则可以将您的模型记录为 PyFunc。

What mlflow.sklearn.log_model does is basically wrap up the way you declare serialization and de-serialization. mlflow.sklearn.log_model所做的基本上就是把你声明序列化和反序列化的方式总结一下。 If you stick with sklearn's basic modules, such as basic transformers and pipelines, you'll always be fine with it.如果您坚持使用 sklearn 的基本模块，例如基本的转换器和管道，那么您将始终可以使用它。 But when you need something custom, then you refer to Pyfuncs instead.但是当你需要一些自定义的东西时，你可以参考 Pyfuncs。

You can find a very useful example here .你可以在这里找到一个非常有用的例子。 Notice it states exactly how to make the predictions, transforming the input into a XGBoost's DMatrix.请注意，它准确说明了如何进行预测，将输入转换为 XGBoost 的 DMatrix。

mlflow 如何使用自定义转换器保存 sklearn 管道？

问题描述

2 个解决方案

解决方案1
0 2020-07-30 13:02:53

解决方案2
0 2021-11-23 19:04:13

mlflow 如何使用自定义转换器保存 sklearn 管道？

问题描述

2 个解决方案

解决方案1 0 2020-07-30 13:02:53

解决方案2 0 2021-11-23 19:04:13

解决方案1
0 2020-07-30 13:02:53

解决方案2
0 2021-11-23 19:04:13