简体   繁体   English

使用自定义变压器时如何正确腌制sklearn管道

[英]How to properly pickle sklearn pipeline when using custom transformer

I am trying to pickle a sklearn machine-learning model, and load it in another project.我正在尝试腌制一个 sklearn 机器学习 model,并将其加载到另一个项目中。 The model is wrapped in pipeline that does feature encoding, scaling etc. The problem starts when i want to use self-written transformers in the pipeline for more advanced tasks. model 封装在执行特征编码、缩放等功能的管道中。当我想在管道中使用自写转换器来执行更高级的任务时,问题就开始了。

Let's say I have 2 projects:假设我有 2 个项目:

  • train_project: it has the custom transformers in src.feature_extraction.transformers.py train_project:它在 src.feature_extraction.transformers.py 中有自定义转换器
  • use_project: it has other things in src, or has no src catalog at all use_project:它在src中有其他东西,或者根本没有src目录

If in "train_project" I save the pipeline with joblib.dump(), and then in "use_project" i load it with joblib.load() it will not find something such as "src.feature_extraction.transformers" and throw exception:如果在“train_project”中我使用 joblib.dump() 保存管道,然后在“use_project”中使用 joblib.load() 加载它,它将找不到诸如“src.feature_extraction.transformers”之类的内容并抛出异常:

ModuleNotFoundError: No module named 'src.feature_extraction' ModuleNotFoundError:没有名为“src.feature_extraction”的模块

I should also add that my intention from the beginning was to simplify usage of the model, so programist can load the model as any other model, pass very simple, human readable features, and all "magic" preprocessing of features for actual model (eg gradient boosting) is happening inside. I should also add that my intention from the beginning was to simplify usage of the model, so programist can load the model as any other model, pass very simple, human readable features, and all "magic" preprocessing of features for actual model (eg梯度提升)正在内部发生。

I thought of creating /dependencies/xxx_model/ catalog in root of both projects, and store all needed classes and functions in there (copy code from "train_project" to "use_project"), so structure of projects is equal and transformers can be loaded.我想在两个项目的根目录中创建 /dependencies/xxx_model/ 目录,并将所有需要的类和函数存储在其中(将代码从“train_project”复制到“use_project”),这样项目的结构是相同的,并且可以加载转换器。 I find this solution extremely inelegant, because it would force the structure of any project where the model would be used.我发现这个解决方案非常不优雅,因为它会强制使用 model 的任何项目的结构。

I thought of just recreating the pipeline and all transformers inside "use_project" and somehow loading fitted values of transformers from "train_project".我想只是在“use_project”中重新创建管道和所有变压器,并以某种方式从“train_project”加载变压器的拟合值。

The best possible solution would be if dumped file contained all needed info and needed no dependencies, and I am honestly shocked that sklearn.Pipelines seem to not have that possibility - what's the point of fitting a pipeline if i can not load fitted object later?最好的解决方案是,如果转储文件包含所有需要的信息并且不需要依赖项,我真的很震惊 sklearn.Pipelines 似乎没有这种可能性 - 如果我以后无法加载安装的 object,那么安装管道有什么意义? Yes it would work if i used only sklearn classes, and not create custom ones, but non-custom ones do not have all needed functionality.是的,如果我只使用 sklearn 类而不创建自定义类,它会起作用,但非自定义类没有所有需要的功能。

Example code:示例代码:

train_project火车项目

src.feature_extraction.transformers.py src.feature_extraction.transformers.py

from sklearn.pipeline import TransformerMixin
class FilterOutBigValuesTransformer(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        self.biggest_value = X.c1.max()
        return self

    def transform(self, X):
        return X.loc[X.c1 <= self.biggest_value]

train_project火车项目

main.py主文件

from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler
from src.feature_extraction.transformers import FilterOutBigValuesTransformer

pipeline = Pipeline([
    ('filter', FilterOutBigValuesTransformer()),
    ('encode', MinMaxScaler()),
])
X=load_some_pandas_dataframe()
pipeline.fit(X)
joblib.dump(pipeline, 'path.x')

test_project测试项目

main.py主文件

from sklearn.externals import joblib

pipeline = joblib.load('path.x')

The expected result is pipeline loaded correctly with transform method possible to use.预期的结果是使用可以使用的转换方法正确加载管道。

Actual result is exception when loading the file.加载文件时实际结果是异常。

I found a pretty straightforward solution.我找到了一个非常简单的解决方案。 Assuming you are using Jupyter notebooks for training:假设您使用 Jupyter 笔记本进行培训:

  1. Create a .py file where the custom transformer is defined and import it to the Jupyter notebook.创建一个.py文件,其中定义了自定义转换器并将其导入 Jupyter 笔记本。

This is the file custom_transformer.py这是文件custom_transformer.py

from sklearn.pipeline import TransformerMixin

class FilterOutBigValuesTransformer(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        self.biggest_value = X.c1.max()
        return self

    def transform(self, X):
        return X.loc[X.c1 <= self.biggest_value]
  1. Train your model importing this class from the .py file and save it using joblib .训练您的模型,从.py文件导入此类并使用joblib保存它。
import joblib
from custom_transformer import FilterOutBigValuesTransformer
from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler

pipeline = Pipeline([
    ('filter', FilterOutBigValuesTransformer()),
    ('encode', MinMaxScaler()),
])

X=load_some_pandas_dataframe()
pipeline.fit(X)

joblib.dump(pipeline, 'pipeline.pkl')
  1. When loading the .pkl file in a different python script, you will have to import the .py file in order to make it work:在不同的 python 脚本中加载.pkl文件时,您必须导入.py文件才能使其工作:
import joblib
from utils import custom_transformer # decided to save it in a utils directory

pipeline = joblib.load('pipeline.pkl')

I have created a workaround solution.我创建了一个解决方案。 I do not consider it a complete answer to my question, but non the less it let me move on from my problem.我不认为它是我问题的完整答案,但它让我从我的问题中继续前进。

Conditions for the workaround to work:变通办法起作用的条件:

I. Pipeline needs to have only 2 kinds of transformers:一、管道只需要2种变压器:

  1. sklearn transformers sklearn 变形金刚
  2. custom transformers, but with only attributes of types:自定义转换器,但只有类型的属性:
    • number数字
    • string细绳
    • list列表
    • dict听写

or any combination of those eg list of dicts with strings and numbers.或任何组合,例如带有字符串和数字的字典列表。 Generally important thing is that attributes are json serializable.通常重要的是属性是 json 可序列化的。

II.二、 names of pipeline steps need to be unique (even if there is pipeline nesting)管道步骤的名称必须是唯一的(即使存在管道嵌套)


In short model would be stored as a catalog with joblib dumped files, a json file for custom transformers, and a json file with other info about model.简而言之,模型将存储为包含 joblib 转储文件的目录、用于自定义转换器的 json 文件以及包含有关模型的其他信息的 json 文件。

I have created a function that goes through steps of a pipeline and checks __module__ attribute of transformer.我创建了一个函数,它通过管道的步骤并检查变压器的 __module__ 属性。

If it finds sklearn in it it then it runs joblib.dump function under a name specified in steps (first element of step tuple), to some selected model catalog.如果它在其中找到 sklearn,那么它将以步骤(步骤元组的第一个元素)中指定的名称运行 joblib.dump 函数到某些选定的模型目录。

Otherwise (no sklearn in __module__) it adds __dict__ of transformer to result_dict under a key equal to name specified in steps.否则(在 __module__ 中没有 sklearn)它会将转换器的 __dict__ 添加到 result_dict 中,其键与步骤中指定的名称相同。 At the end I json.dump the result_dict to model catalog under name result_dict.json.最后,我将 result_dict json.dump 到名为 result_dict.json 的模型目录中。

If there is a need to go into some transformer, because eg there is a Pipeline inside a pipeline, you can probably run this function recursively by adding some rules to the beginning of the function, but it becomes important to have always unique steps/transformers names even between main pipeline and subpipelines.如果需要进入某个转换器,因为例如管道中有一个管道,您可以通过在函数的开头添加一些规则来递归地运行这个函数,但是始终拥有唯一的步骤/转换器变得很重要甚至在主管道和子管道之间命名。

If there are other information needed for creation of model pipeline then save them in model_info.json.如果创建模型管道需要其他信息,请将它们保存在 model_info.json 中。


Then if you want to load the model for usage: You need to create (without fitting) the same pipeline in target project.然后,如果您想加载模型以供使用:您需要在目标项目中创建(不拟合)相同的管道。 If pipeline creation is somewhat dynamic, and you need information from source project, then load it from model_info.json.如果管道创建有点动态,并且您需要来自源项目的信息,则从 model_info.json 加载它。

You can copy function used for serialization and:您可以复制用于序列化的函数,并且:

  • replace all joblib.dump with joblib.load statements, assign __dict__ from loaded object to __dict__ of object already in pipeline用 joblib.load 语句替换所有 joblib.dump,将 __dict__ 从加载的对象分配给已经在管道中的对象的 __dict__
  • replace all places where you added __dict__ to result_dict with assignment of appropriate value from result_dict to object __dict__ (remember to load result_dict from file beforehand)将添加 __dict__ 到 result_dict 的所有位置替换为将 result_dict 中的适当值分配给对象 __dict__ (记得事先从文件中加载 result_dict)

After running this modified function, previously unfitted pipeline should have all transformer attributes that were effect of fitting loaded, and pipeline as a whole ready to predict.运行此修改后的功能后,先前未拟合的管道应该具有加载拟合效果的所有转换器属性,并且整个管道准备好进行预测。

The main things I do not like about this solution is that it needs pipeline code inside target project, and needs all attrs of custom transformers to be json serializable, but I leave it here for other people that stumble on a similar problem, maybe somebody comes up with something better.我不喜欢这个解决方案的主要事情是它需要目标项目中的管道代码,并且需要自定义转换器的所有属性都是 json 可序列化的,但是我把它留在这里给遇到类似问题的其他人,也许有人会来有了更好的东西。

Have you tried using cloud pickle?你试过用云泡菜吗? https://github.com/cloudpipe/cloudpickle https://github.com/cloudpipe/cloudpickle

Based on my research it seems that the best solution is to create a Python package that includes your trained pipeline and all files.根据我的研究,似乎最好的解决方案是创建一个 Python 包,其中包含经过训练的管道和所有文件。

Then you can pip install it in the project where you want to use it and import the pipeline with from <package name> import <pipeline name> .然后,您可以将其 pip 安装在要使用它的项目中,并使用from <package name> import <pipeline name>

Apparently this problem raises when you split definitions and saving code part in two different files .显然,当您拆分定义并将代码部分保存在两个不同的文件中时,就会出现此问题。 So I have found this workaround that has worked for me.所以我找到了这个对我有用的解决方法。

It consists in these steps:它包括以下步骤:

Guess we have your 2 projects/repositories : train_project and use_project猜猜我们有你的 2 个项目/存储库:train_project 和 use_project

train_project:火车项目:

  • On your train_project create a jupyter notebook or .py在您的 train_project 上创建一个jupyter notebook或 .py

  • On that file lets define every Custom transformer in a class, and import all other tools needed from sklearn to design the pipelines.在该文件上,让我们定义一个类中的每个自定义转换器,并从 sklearn 导入所有其他工具来设计管道。 Then lets write the saving code to pickle just inside the same file .( Don't create an external .py file src.feature_extraction.transformers to define your customtransformers ).然后让我们在同一个文件中编写保存代码来腌制。(不要创建外部 .py 文件src.feature_extraction.transformers来定义您的 customtransformers )。

  • Then fit and dumb your pipeline by running that file.然后通过运行该文件来安装和简化您的管道。

On use_project:在 use_project 上:

  • Create a customthings.py file with all the functions and transformers defined inside.创建一个customthings.py文件,其中定义了所有函数和转换器
  • Create another file_where_load.py where you wish load the pickle .在您希望加载pickle的位置创建另一个file_where_load.py Inside, make sure you have imported all the definitions from customthings.py .在里面,确保你已经从 customthings.py导入了所有的定义。 Ensure that functions and classes have the same name than the ones you've used on train_project.确保函数和类名称与您在 train_project 上使用的名称相同。

I hope it works for everyone with same problem我希望它适用于每个有同样问题的人

I was similarly surprised when I came across the same problem some time ago.当我前段时间遇到同样的问题时,我同样感到惊讶。 Yet there are multiple ways to address this.然而,有多种方法可以解决这个问题。

Best practice solution:最佳实践解决方案:

As others have mentioned, the best practice solution is to move all dependencies of your pipeline into a separate Python package and define that package as a dependency of your model environment.正如其他人所提到的,最佳实践解决方案是将管道的所有依赖项移动到单独的 Python 包中,并将该包定义为模型环境的依赖项。

The environment then has to be recreated whenever the model is deployed.然后,无论何时部署模型,都必须重新创建环境。 In simple cases this can be done manually eg via virtualenv or Poetry.在简单的情况下,这可以手动完成,例如通过 virtualenv 或 Poetry。 But model stores and versioning frameworks (MLflow being one example) typically provide a way to define the required Python environment (eg via conda.yaml ).但是模型存储和版本控制框架(MLflow 就是一个例子)通常提供了一种定义所需 Python 环境的方法(例如通过conda.yaml )。 They often can automatically recreate the environment at deployment time.他们通常可以在部署时自动重新创建环境。

Solution by putting code into main :将代码放入main的解决方案:

In fact, class and function declearations can be serialized, but only declarations in __main__ actually get serialized.事实上,类和函数的清除可以被序列化,但只有__main__中的声明真正序列化。 __main__ is the entry point of the script, the file that is run. __main__是脚本的入口点,即运行的文件。 So if all the custom code and all of its dependencies are in that file, then custom objects can later be loaded in Python environments that do not include the code.因此,如果所有自定义代码及其所有依赖项都在该文件中,则以后可以在不包含该代码的 Python 环境中加载自定义对象。 This kind of solves the problem, but who wants to have all that code in __main__ ?这种方法解决了问题,但谁想在__main__中拥有所有代码? (Note that this property also applies to cloudpickle) (请注意,此属性也适用于 cloudpickle)

Solution by "mainifying":通过“维护”解决方案:

There is one other way which is to "mainify" the classes or function objects before saving.还有另一种方法是在保存之前“维护”类或函数对象。 I came across that same problem some time ago and have written a function that does that.前段时间我遇到了同样的问题,并编写了一个函数来做到这一点。 It essentially redefines an existing object's code in __main__ .它本质上重新定义__main__中现有对象的代码。 Its application is simple: Pass object to function, then serialize the object, voilà, it can be loaded anywhere.它的应用很简单:将对象传递给函数,然后序列化对象,瞧,它可以在任何地方加载。 Like so:像这样:

# ------ In file1.py: ------    
    
class Foo():
    pass

# ------ In file2.py: ------
from file1 import Foo    

foo = Foo()
foo = mainify(foo)

import dill
    
with open('path/file.dill', 'wb') as f
   dill.dump(foo, f)

I post the function code below.我在下面发布功能代码。 Note that I have tested this with dill, but I think it should work with pickle as well.请注意,我已经用莳萝对此进行了测试,但我认为它也应该适用于泡菜。

Also note that the original idea is not mine, but came from a blog post that I cannot find right now.另请注意,最初的想法不是我的,而是来自我现在找不到的博客文章。 I will add the reference/acknowledgement when I find it.当我找到它时,我会添加参考/致谢。 Edit: Blog post by Oege Dijk by which my code was inspired.编辑: Oege Dijk 的博客文章,我的代码受此启发。

def mainify(obj, warn_if_exist=True):
    ''' If obj is not defined in __main__ then redefine it in main. Allows dill 
    to serialize custom classes and functions such that they can later be loaded
    without them being declared in the load environment.

    Parameters
    ---------
    obj           : Object to mainify (function or class instance)
    warn_if_exist : Bool, default True. Throw exception if function (or class) of
                    same name as the mainified function (or same name as mainified
                    object's __class__) was already defined in __main__. If False
                    don't throw exception and instead use what was defined in
                    __main__. See Limitations.
    Limitations
    -----------
    Assumes `obj` is either a function or an instance of a class.                
    ''' 
    if obj.__module__ != '__main__':                                                
        
        import __main__       
        is_func = True if isinstance(obj, types.FunctionType) else False                                            
        
        # Check if obj with same name is already defined in __main__ (for funcs)
        # or if class with same name as obj's class is already defined in __main__.
        # If so, simply return the func with same name from __main__ (for funcs)
        # or assign the class of same name to obj and return the modified obj        
        if is_func:
            on = obj.__name__
            if on in __main__.__dict__.keys():
                if warn_if_exist:
                    raise RuntimeError(f'Function with __name__ `{on}` already defined in __main__')
                return __main__.__dict__[on]
        else:
            ocn = obj.__class__.__name__
            if ocn  in __main__.__dict__.keys():
                if warn_if_exist:
                    raise RuntimeError(f'Class with obj.__class__.__name__ `{ocn}` already defined in __main__')
                obj.__class__ = __main__.__dict__[ocn]                
                return obj
                                
        # Get source code and compile
        source = inspect.getsource(obj if is_func else obj.__class__)
        compiled = compile(source, '<string>', 'exec')                    
        # "declare" in __main__, keeping track which key of __main__ dict is the new one        
        pre = list(__main__.__dict__.keys()) 
        exec(compiled, __main__.__dict__)
        post = list(__main__.__dict__.keys())                        
        new_in_main = list(set(post) - set(pre))[0]
        
        # for function return mainified version, else assign new class to obj and return object
        if is_func:
            obj = __main__.__dict__[new_in_main]            
        else:            
            obj.__class__ = __main__.__dict__[new_in_main]
                
    return obj

Credit to Ture Friese for mentioning cloudpickle >=2.0.0 , but here's an example for your use case.感谢Ture Friese提到cloudpickle >=2.0.0 ,但这是您的用例的示例。

import cloudpickle

cloudpickle.register_pickle_by_value(FilterOutBigValuesTransformer)
with open('./pipeline.cloudpkl', mode='wb') as file:
    pipeline.dump(
        obj=Pipe
        , file=file
    )

register_pickle_by_value() is the key as it will ensure your custom module ( src.feature_extraction.transformers ) is also included when serializing your primary object ( pipeline ). register_pickle_by_value()是关键,因为它将确保在序列化主 object ( pipeline ) 时也包含您的自定义模块 ( src.feature_extraction.transformers )。 However, this is not built for recursive module dependence, eg if FilterOutBigValuesTransformer also contains another import statement但是,这不是为递归模块依赖而构建的,例如,如果FilterOutBigValuesTransformer还包含另一个import语句

Calling the location of the transform.py file with sys.path.append may resolve the issue.使用 sys.path.append 调用 transform.py 文件的位置可能会解决问题。

import sys
sys.path.append("src/feature_extraction/transformers")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM