[英]How to properly pickle sklearn pipeline when using custom transformer
I am trying to pickle a sklearn machine-learning model, and load it in another project.我正在尝试腌制一个 sklearn 机器学习 model,并将其加载到另一个项目中。 The model is wrapped in pipeline that does feature encoding, scaling etc. The problem starts when i want to use self-written transformers in the pipeline for more advanced tasks.
model 封装在执行特征编码、缩放等功能的管道中。当我想在管道中使用自写转换器来执行更高级的任务时,问题就开始了。
Let's say I have 2 projects:假设我有 2 个项目:
If in "train_project" I save the pipeline with joblib.dump(), and then in "use_project" i load it with joblib.load() it will not find something such as "src.feature_extraction.transformers" and throw exception:如果在“train_project”中我使用 joblib.dump() 保存管道,然后在“use_project”中使用 joblib.load() 加载它,它将找不到诸如“src.feature_extraction.transformers”之类的内容并抛出异常:
ModuleNotFoundError: No module named 'src.feature_extraction'
ModuleNotFoundError:没有名为“src.feature_extraction”的模块
I should also add that my intention from the beginning was to simplify usage of the model, so programist can load the model as any other model, pass very simple, human readable features, and all "magic" preprocessing of features for actual model (eg gradient boosting) is happening inside. I should also add that my intention from the beginning was to simplify usage of the model, so programist can load the model as any other model, pass very simple, human readable features, and all "magic" preprocessing of features for actual model (eg梯度提升)正在内部发生。
I thought of creating /dependencies/xxx_model/ catalog in root of both projects, and store all needed classes and functions in there (copy code from "train_project" to "use_project"), so structure of projects is equal and transformers can be loaded.我想在两个项目的根目录中创建 /dependencies/xxx_model/ 目录,并将所有需要的类和函数存储在其中(将代码从“train_project”复制到“use_project”),这样项目的结构是相同的,并且可以加载转换器。 I find this solution extremely inelegant, because it would force the structure of any project where the model would be used.
我发现这个解决方案非常不优雅,因为它会强制使用 model 的任何项目的结构。
I thought of just recreating the pipeline and all transformers inside "use_project" and somehow loading fitted values of transformers from "train_project".我想只是在“use_project”中重新创建管道和所有变压器,并以某种方式从“train_project”加载变压器的拟合值。
The best possible solution would be if dumped file contained all needed info and needed no dependencies, and I am honestly shocked that sklearn.Pipelines seem to not have that possibility - what's the point of fitting a pipeline if i can not load fitted object later?最好的解决方案是,如果转储文件包含所有需要的信息并且不需要依赖项,我真的很震惊 sklearn.Pipelines 似乎没有这种可能性 - 如果我以后无法加载安装的 object,那么安装管道有什么意义? Yes it would work if i used only sklearn classes, and not create custom ones, but non-custom ones do not have all needed functionality.
是的,如果我只使用 sklearn 类而不创建自定义类,它会起作用,但非自定义类没有所有需要的功能。
Example code:示例代码:
train_project火车项目
src.feature_extraction.transformers.py src.feature_extraction.transformers.py
from sklearn.pipeline import TransformerMixin
class FilterOutBigValuesTransformer(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
self.biggest_value = X.c1.max()
return self
def transform(self, X):
return X.loc[X.c1 <= self.biggest_value]
train_project火车项目
main.py主文件
from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler
from src.feature_extraction.transformers import FilterOutBigValuesTransformer
pipeline = Pipeline([
('filter', FilterOutBigValuesTransformer()),
('encode', MinMaxScaler()),
])
X=load_some_pandas_dataframe()
pipeline.fit(X)
joblib.dump(pipeline, 'path.x')
test_project测试项目
main.py主文件
from sklearn.externals import joblib
pipeline = joblib.load('path.x')
The expected result is pipeline loaded correctly with transform method possible to use.预期的结果是使用可以使用的转换方法正确加载管道。
Actual result is exception when loading the file.加载文件时实际结果是异常。
I found a pretty straightforward solution.我找到了一个非常简单的解决方案。 Assuming you are using Jupyter notebooks for training:
假设您使用 Jupyter 笔记本进行培训:
.py
file where the custom transformer is defined and import it to the Jupyter notebook..py
文件,其中定义了自定义转换器并将其导入 Jupyter 笔记本。 This is the file custom_transformer.py
这是文件
custom_transformer.py
from sklearn.pipeline import TransformerMixin
class FilterOutBigValuesTransformer(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
self.biggest_value = X.c1.max()
return self
def transform(self, X):
return X.loc[X.c1 <= self.biggest_value]
.py
file and save it using joblib
..py
文件导入此类并使用joblib
保存它。import joblib
from custom_transformer import FilterOutBigValuesTransformer
from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler
pipeline = Pipeline([
('filter', FilterOutBigValuesTransformer()),
('encode', MinMaxScaler()),
])
X=load_some_pandas_dataframe()
pipeline.fit(X)
joblib.dump(pipeline, 'pipeline.pkl')
.pkl
file in a different python script, you will have to import the .py
file in order to make it work:.pkl
文件时,您必须导入.py
文件才能使其工作:import joblib
from utils import custom_transformer # decided to save it in a utils directory
pipeline = joblib.load('pipeline.pkl')
I have created a workaround solution.我创建了一个解决方案。 I do not consider it a complete answer to my question, but non the less it let me move on from my problem.
我不认为它是我问题的完整答案,但它让我从我的问题中继续前进。
Conditions for the workaround to work:变通办法起作用的条件:
I. Pipeline needs to have only 2 kinds of transformers:一、管道只需要2种变压器:
or any combination of those eg list of dicts with strings and numbers.或任何组合,例如带有字符串和数字的字典列表。 Generally important thing is that attributes are json serializable.
通常重要的是属性是 json 可序列化的。
II.二、 names of pipeline steps need to be unique (even if there is pipeline nesting)
管道步骤的名称必须是唯一的(即使存在管道嵌套)
In short model would be stored as a catalog with joblib dumped files, a json file for custom transformers, and a json file with other info about model.简而言之,模型将存储为包含 joblib 转储文件的目录、用于自定义转换器的 json 文件以及包含有关模型的其他信息的 json 文件。
I have created a function that goes through steps of a pipeline and checks __module__ attribute of transformer.我创建了一个函数,它通过管道的步骤并检查变压器的 __module__ 属性。
If it finds sklearn in it it then it runs joblib.dump function under a name specified in steps (first element of step tuple), to some selected model catalog.如果它在其中找到 sklearn,那么它将以步骤(步骤元组的第一个元素)中指定的名称运行 joblib.dump 函数到某些选定的模型目录。
Otherwise (no sklearn in __module__) it adds __dict__ of transformer to result_dict under a key equal to name specified in steps.否则(在 __module__ 中没有 sklearn)它会将转换器的 __dict__ 添加到 result_dict 中,其键与步骤中指定的名称相同。 At the end I json.dump the result_dict to model catalog under name result_dict.json.
最后,我将 result_dict json.dump 到名为 result_dict.json 的模型目录中。
If there is a need to go into some transformer, because eg there is a Pipeline inside a pipeline, you can probably run this function recursively by adding some rules to the beginning of the function, but it becomes important to have always unique steps/transformers names even between main pipeline and subpipelines.如果需要进入某个转换器,因为例如管道中有一个管道,您可以通过在函数的开头添加一些规则来递归地运行这个函数,但是始终拥有唯一的步骤/转换器变得很重要甚至在主管道和子管道之间命名。
If there are other information needed for creation of model pipeline then save them in model_info.json.如果创建模型管道需要其他信息,请将它们保存在 model_info.json 中。
Then if you want to load the model for usage: You need to create (without fitting) the same pipeline in target project.然后,如果您想加载模型以供使用:您需要在目标项目中创建(不拟合)相同的管道。 If pipeline creation is somewhat dynamic, and you need information from source project, then load it from model_info.json.
如果管道创建有点动态,并且您需要来自源项目的信息,则从 model_info.json 加载它。
You can copy function used for serialization and:您可以复制用于序列化的函数,并且:
After running this modified function, previously unfitted pipeline should have all transformer attributes that were effect of fitting loaded, and pipeline as a whole ready to predict.运行此修改后的功能后,先前未拟合的管道应该具有加载拟合效果的所有转换器属性,并且整个管道准备好进行预测。
The main things I do not like about this solution is that it needs pipeline code inside target project, and needs all attrs of custom transformers to be json serializable, but I leave it here for other people that stumble on a similar problem, maybe somebody comes up with something better.我不喜欢这个解决方案的主要事情是它需要目标项目中的管道代码,并且需要自定义转换器的所有属性都是 json 可序列化的,但是我把它留在这里给遇到类似问题的其他人,也许有人会来有了更好的东西。
Have you tried using cloud pickle?你试过用云泡菜吗? https://github.com/cloudpipe/cloudpickle
https://github.com/cloudpipe/cloudpickle
Based on my research it seems that the best solution is to create a Python package that includes your trained pipeline and all files.根据我的研究,似乎最好的解决方案是创建一个 Python 包,其中包含经过训练的管道和所有文件。
Then you can pip install it in the project where you want to use it and import the pipeline with from <package name> import <pipeline name>
.然后,您可以将其 pip 安装在要使用它的项目中,并使用
from <package name> import <pipeline name>
。
Apparently this problem raises when you split definitions and saving code part in two different files .显然,当您拆分定义并将代码部分保存在两个不同的文件中时,就会出现此问题。 So I have found this workaround that has worked for me.
所以我找到了这个对我有用的解决方法。
It consists in these steps:它包括以下步骤:
Guess we have your 2 projects/repositories : train_project and use_project猜猜我们有你的 2 个项目/存储库:train_project 和 use_project
train_project:火车项目:
On your train_project create a jupyter notebook or .py在您的 train_project 上创建一个jupyter notebook或 .py
On that file lets define every Custom transformer in a class, and import all other tools needed from sklearn to design the pipelines.在该文件上,让我们定义一个类中的每个自定义转换器,并从 sklearn 导入所有其他工具来设计管道。 Then lets write the saving code to pickle just inside the same file .( Don't create an external .py file src.feature_extraction.transformers to define your customtransformers ).
然后让我们在同一个文件中编写保存代码来腌制。(不要创建外部 .py 文件src.feature_extraction.transformers来定义您的 customtransformers )。
Then fit and dumb your pipeline by running that file.然后通过运行该文件来安装和简化您的管道。
On use_project:在 use_project 上:
I hope it works for everyone with same problem我希望它适用于每个有同样问题的人
I was similarly surprised when I came across the same problem some time ago.当我前段时间遇到同样的问题时,我同样感到惊讶。 Yet there are multiple ways to address this.
然而,有多种方法可以解决这个问题。
As others have mentioned, the best practice solution is to move all dependencies of your pipeline into a separate Python package and define that package as a dependency of your model environment.正如其他人所提到的,最佳实践解决方案是将管道的所有依赖项移动到单独的 Python 包中,并将该包定义为模型环境的依赖项。
The environment then has to be recreated whenever the model is deployed.然后,无论何时部署模型,都必须重新创建环境。 In simple cases this can be done manually eg via virtualenv or Poetry.
在简单的情况下,这可以手动完成,例如通过 virtualenv 或 Poetry。 But model stores and versioning frameworks (MLflow being one example) typically provide a way to define the required Python environment (eg via
conda.yaml
).但是模型存储和版本控制框架(MLflow 就是一个例子)通常提供了一种定义所需 Python 环境的方法(例如通过
conda.yaml
)。 They often can automatically recreate the environment at deployment time.他们通常可以在部署时自动重新创建环境。
In fact, class and function declearations can be serialized, but only declarations in __main__
actually get serialized.事实上,类和函数的清除可以被序列化,但只有
__main__
中的声明真正被序列化。 __main__
is the entry point of the script, the file that is run. __main__
是脚本的入口点,即运行的文件。 So if all the custom code and all of its dependencies are in that file, then custom objects can later be loaded in Python environments that do not include the code.因此,如果所有自定义代码及其所有依赖项都在该文件中,则以后可以在不包含该代码的 Python 环境中加载自定义对象。 This kind of solves the problem, but who wants to have all that code in
__main__
?这种方法解决了问题,但谁想在
__main__
中拥有所有代码? (Note that this property also applies to cloudpickle) (请注意,此属性也适用于 cloudpickle)
There is one other way which is to "mainify" the classes or function objects before saving.还有另一种方法是在保存之前“维护”类或函数对象。 I came across that same problem some time ago and have written a function that does that.
前段时间我遇到了同样的问题,并编写了一个函数来做到这一点。 It essentially redefines an existing object's code in
__main__
.它本质上重新定义了
__main__
中现有对象的代码。 Its application is simple: Pass object to function, then serialize the object, voilà, it can be loaded anywhere.它的应用很简单:将对象传递给函数,然后序列化对象,瞧,它可以在任何地方加载。 Like so:
像这样:
# ------ In file1.py: ------
class Foo():
pass
# ------ In file2.py: ------
from file1 import Foo
foo = Foo()
foo = mainify(foo)
import dill
with open('path/file.dill', 'wb') as f
dill.dump(foo, f)
I post the function code below.我在下面发布功能代码。 Note that I have tested this with dill, but I think it should work with pickle as well.
请注意,我已经用莳萝对此进行了测试,但我认为它也应该适用于泡菜。
Also note that the original idea is not mine, but came from a blog post that I cannot find right now.另请注意,最初的想法不是我的,而是来自我现在找不到的博客文章。 I will add the reference/acknowledgement when I find it.
当我找到它时,我会添加参考/致谢。 Edit: Blog post by Oege Dijk by which my code was inspired.
编辑: Oege Dijk 的博客文章,我的代码受此启发。
def mainify(obj, warn_if_exist=True):
''' If obj is not defined in __main__ then redefine it in main. Allows dill
to serialize custom classes and functions such that they can later be loaded
without them being declared in the load environment.
Parameters
---------
obj : Object to mainify (function or class instance)
warn_if_exist : Bool, default True. Throw exception if function (or class) of
same name as the mainified function (or same name as mainified
object's __class__) was already defined in __main__. If False
don't throw exception and instead use what was defined in
__main__. See Limitations.
Limitations
-----------
Assumes `obj` is either a function or an instance of a class.
'''
if obj.__module__ != '__main__':
import __main__
is_func = True if isinstance(obj, types.FunctionType) else False
# Check if obj with same name is already defined in __main__ (for funcs)
# or if class with same name as obj's class is already defined in __main__.
# If so, simply return the func with same name from __main__ (for funcs)
# or assign the class of same name to obj and return the modified obj
if is_func:
on = obj.__name__
if on in __main__.__dict__.keys():
if warn_if_exist:
raise RuntimeError(f'Function with __name__ `{on}` already defined in __main__')
return __main__.__dict__[on]
else:
ocn = obj.__class__.__name__
if ocn in __main__.__dict__.keys():
if warn_if_exist:
raise RuntimeError(f'Class with obj.__class__.__name__ `{ocn}` already defined in __main__')
obj.__class__ = __main__.__dict__[ocn]
return obj
# Get source code and compile
source = inspect.getsource(obj if is_func else obj.__class__)
compiled = compile(source, '<string>', 'exec')
# "declare" in __main__, keeping track which key of __main__ dict is the new one
pre = list(__main__.__dict__.keys())
exec(compiled, __main__.__dict__)
post = list(__main__.__dict__.keys())
new_in_main = list(set(post) - set(pre))[0]
# for function return mainified version, else assign new class to obj and return object
if is_func:
obj = __main__.__dict__[new_in_main]
else:
obj.__class__ = __main__.__dict__[new_in_main]
return obj
Credit to Ture Friese for mentioning cloudpickle >=2.0.0 , but here's an example for your use case.感谢Ture Friese提到cloudpickle >=2.0.0 ,但这是您的用例的示例。
import cloudpickle
cloudpickle.register_pickle_by_value(FilterOutBigValuesTransformer)
with open('./pipeline.cloudpkl', mode='wb') as file:
pipeline.dump(
obj=Pipe
, file=file
)
register_pickle_by_value()
is the key as it will ensure your custom module ( src.feature_extraction.transformers
) is also included when serializing your primary object ( pipeline
). register_pickle_by_value()
是关键,因为它将确保在序列化主 object ( pipeline
) 时也包含您的自定义模块 ( src.feature_extraction.transformers
)。 However, this is not built for recursive module dependence, eg if FilterOutBigValuesTransformer
also contains another import
statement但是,这不是为递归模块依赖而构建的,例如,如果
FilterOutBigValuesTransformer
还包含另一个import
语句
Calling the location of the transform.py file with sys.path.append may resolve the issue.使用 sys.path.append 调用 transform.py 文件的位置可能会解决问题。
import sys
sys.path.append("src/feature_extraction/transformers")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.