自定义 sklearn 管道转换器给出“pickle.PicklingError”

Question

I am trying to create a custom transformer for a Python sklearn pipeline based on guidance from this tutorial: http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/我正在尝试根据本教程的指导为 Python sklearn 管道创建自定义转换器： http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/

Right now my custom class/transformer looks like this:现在我的自定义类/转换器看起来像这样：

class SelectBestPercFeats(BaseEstimator, TransformerMixin):
    def __init__(self, model=RandomForestRegressor(), percent=0.8,
                 random_state=52):
        self.model = model
        self.percent = percent
        self.random_state = random_state


    def fit(self, X, y, **fit_params):
        """
        Find features with best predictive power for the model, and
        have cumulative importance value less than self.percent
        """
        # Check parameters
        if not isinstance(self.percent, float):
            print("SelectBestPercFeats.percent is not a float, it should be...")
        elif not isinstance(self.random_state, int):
            print("SelectBestPercFeats.random_state is not a int, it should be...")

        # If checks are good proceed with fitting...
        else:
            try:
                self.model.fit(X, y)
            except:
                print("Error fitting model inside SelectBestPercFeats object")
                return self

            # Get feature importance
            try:
                feat_imp = list(self.model.feature_importances_)
                feat_imp_cum = pd.Series(feat_imp, index=X.columns) \
                    .sort_values(ascending=False).cumsum()

                # Get features whose cumulative importance is <= `percent`
                n_feats = len(feat_imp_cum[feat_imp_cum <= self.percent].index) + 1
                self.bestcolumns_ = list(feat_imp_cum.index)[:n_feats]
            except:
                print ("ERROR: SelectBestPercFeats can only be used with models with"\
                       " .feature_importances_ parameter")
        return self


    def transform(self, X, y=None, **fit_params):
        """
        Filter out only the important features (based on percent threshold)
        for the model supplied.

        :param X: Dataframe with features to be down selected
        """
        if self.bestcolumns_ is None:
            print("Must call fit function on SelectBestPercFeats object before transforming")
        else:
            return X[self.bestcolumns_]

I am integrating this Class into an sklearn pipeline like this:我正在将这个 Class 集成到这样的 sklearn 管道中：

# Define feature selection and model pipeline components
rf_simp = RandomForestRegressor(criterion='mse', n_jobs=-1,
                                n_estimators=600)
bestfeat = SelectBestPercFeats(rf_simp, feat_perc)
rf = RandomForestRegressor(n_jobs=-1,
                           criterion='mse',
                           n_estimators=200,
                           max_features=0.4,
                           )

# Build Pipeline
master_model = Pipeline([('feat_sel', bestfeat), ('rf', rf)])

# define GridSearchCV parameter space to search, 
#   only listing one parameter to simplify troubleshooting
param_grid = {
    'feat_select__percent': [0.8],
}

# Fit pipeline model
grid = GridSearchCV(master_model, cv=3, n_jobs=-1,
                    param_grid=param_grid)

# Search grid using CV, and get the best estimator
grid.fit(X_train, y_train)

Whenever I run the last line of code ( grid.fit(X_train, y_train) ) I get the following "PicklingError".每当我运行最后一行代码 ( grid.fit(X_train, y_train) ) 时，我都会得到以下“PicklingError”。 Can anyone see what is causing this problem in my code?任何人都可以在我的代码中看到导致此问题的原因吗？

EDIT:编辑：

Or, is there something in my Python setup that's wrong... Might I be missing a package or something similar?或者，我的 Python 设置中是否有错误...我可能缺少 package 或类似的东西吗？ I just checked that I can import pickle successfully我刚刚检查过我可以成功import pickle

Traceback (most recent call last): File "", line 5, in File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\model_selection_search.py", line 945, in fit return self._fit(X, y, groups, ParameterGrid(self.param_grid)) File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\model_selection_search.py", line 564, in _fit for parameters in parameter_iterable File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 768, in call self.retrieve() File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 719, in retrieve raise exception File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 682, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\multiprocessing\追溯（最近调用最后）：文件“”，第 5 行，在文件“C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\model_selection_search.py”中，第 945 行, 适合返回 self._fit(X, y, groups, ParameterGrid(self.param_grid)) File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\model_selection_search. py”，第 564 行，在 _fit for parameter_iterable 文件“C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py”中的参数，第 768 行, 在调用self.retrieve() File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py", line 719, in retrieve raise异常文件“C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\parallel.py”，第 682 行，在 retrieve self._output.extend(job. get(timeout=self.timeout)) 文件 "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\multiprocessing\ pool.py", line 608, in get raise self._value File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 385, in _handle_tasks put(task) File "C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\pool.py", line 371, in send CustomizablePickler(buffer, self._reducers).dump(obj) _pickle.PicklingError: Can't pickle: attribute lookup SelectBestPercFeats on builtins failed pool.py”，第 608 行，在 get raise self._value 文件“C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py”，第 385 行，在 _handle_tasks put(task ) 文件“C:\Users\jjaaae\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\externals\joblib\pool.py”，第 371 行，发送 CustomizablePickler(buffer, self._reducers) .dump(obj) _pickle.PicklingError: 无法 pickle: 内置函数上的属性查找 SelectBestPercFeats 失败

Answer 1

The pickle package needs the custom class(es) to be defined in another module and then imported. pickle 包需要在另一个模块中定义自定义类，然后导入。 So, create another python package file (eg transformation.py ) and then import it like this from transformation import SelectBestPercFeats .因此，创建另一个python 包文件（例如transformation.py ），然后像这样from transformation import SelectBestPercFeats导入它。 That will resolve the pickling error.这将解决酸洗错误。

Answer 2

When you code your own transformer, and IF this transformer contains code that can't be serialized, then a whole pipeline won't be serializable if you try to serialize it.当您编写自己的转换器时，如果该转换器包含无法序列化的代码，那么如果您尝试对其进行序列化，则整个管道将无法序列化。

Not only that, but also, you need such serialization to be able to parallelize your things, such as visible with n_jobs=-1 as you've put, to use many threads.不仅如此，而且，您还需要这样的序列化才能并行化您的事物，例如您所放置的n_jobs=-1可见，以使用许多线程。

A bad thing with scikit-learn is that every object should have its saver. scikit-learn 的一个坏处是每个对象都应该有它的保护程序。 Hopefully, there is a solution.希望有解决方案。 It's either to make your object serializable (and hence removing things you import from external libs), or else to have only 1 job (no threading), or else to make your object have a saver that will save the object to serialize it.它要么使您的对象可序列化（并因此删除您从外部库中导入的内容），要么只有 1 个作业（无线程），或者使您的对象具有保存对象以对其进行序列化的保护程序。 The second solution will be explored here.这里将探讨第二种解决方案。

First, here is the definition of a problem, and its solution, taken from this source :首先，这是一个问题的定义及其解决方案，取自此来源：

Problem: You can't Parallelize nor Save Pipelines Using Steps that Can't be Serialized “as-is” by Joblib问题：您无法使用 Joblib 无法“按原样”序列化的步骤并行化或保存管道

This problem will only surface past some point of using Scikit-Learn.这个问题只会在使用 Scikit-Learn 的某个时刻浮出水面。 This is the point of no-return: you've coded your entire production pipeline, but once you trained it and selected the best model, you realize that what you've just coded can't be serialized.这是不归路：您已经对整个生产流程进行了编码，但是一旦您对其进行了训练并选择了最佳模型，您就会意识到您刚刚编码的内容无法序列化。

This means once trained, your pipeline can't be saved to disks because one of its steps imports things from a weird python library coded in another language and/or uses GPU resources.这意味着一旦经过训练，您的管道就无法保存到磁盘，因为其中一个步骤是从用另一种语言编码的奇怪 Python 库中导入内容和/或使用 GPU 资源。 Your code smells weird and you start panicking over what was a full year of research development.你的代码闻起来很奇怪，你开始对一整年的研究开发感到恐慌。

Hopefully, you're nice enough to start coding your own open-source framework on the side because you'll live this same situation in your next 100 coding projects, and you have other clients who will be in the same situation soon, and this sh** is critical.希望你足够好，可以开始编写自己的开源框架，因为你将在接下来的 100 个编码项目中遇到同样的情况，而且你的其他客户很快也会遇到同样的情况，这sh** 很关键。

Well, that's out of shared need that Neuraxle was created.嗯，Neuraxle 的创建是出于共同的需要。

Solution: Use a Chain of Savers in each Step解决方案：在每个步骤中使用一系列 Savers

Each step is responsible for saving itself, and you should define one or many custom saver objects for your weird object.每一步都负责保存自己，你应该为你的奇怪对象定义一个或多个自定义保存对象。 The saver should:储蓄者应该：

Save what's important in the step using a Saver (See: Saver )使用 Saver 保存步骤中的重要内容（请参阅： Saver ）
Delete that from the step (to make it serializable).从步骤中删除它（使其可序列化）。 The step is now stripped by the Saver.现在该步骤已被 Saver 剥离。
Then the default JoblibStepSaver will execute (in chain) past that point by saving all what's left of the stripped object and deleting the object from your code's RAM.然后默认的 JoblibStepSaver 将通过保存剥离对象的所有剩余内容并从代码的 RAM 中删除该对象来执行（链式）超过该点。 This means you can have many partial savers before the final default JoblibStepSaver.这意味着您可以在最终默认 JoblibStepSaver 之前拥有许多部分保存程序。

For instance, a Pipeline will do the following upon having the save() method called, as it has its own TruncableJoblibStepSaver :例如，管道将在调用 save() 方法时执行以下操作，因为它有自己的 TruncableJoblibStepSaver ：

Save all its substeps in relative subfolders to the pipeline's serialization's subfolder将相关子文件夹中的所有子步骤保存到管道的序列化子文件夹中
Delete them from the pipeline object, except for their names to find them later when loading.从管道对象中删除它们，除了它们的名称以便稍后在加载时找到它们。 The pipeline is now stripped.管道现在被剥离。
Let the default saver save the stripped pipeline.让默认保护程序保存剥离的管道。

You don't want to do dirty code.你不想做脏代码。 Don't break the Law of Demeter, they say.他们说，不要违反迪米特法则。 This is one of the most important (and easily overlooked) laws of programming, in my opinion.在我看来，这是最重要（也是最容易被忽视的）编程法则之一。 Google it, I dare you.谷歌一下，我敢。 Breaking this law is the root of most evil in your codebase.违反这条法律是代码库中最邪恶的根源。

I've come to the conclusion that the neatest way to not break this law here is by having a chain of Savers.我得出的结论是，在这里不违反这条法律的最好方法是拥有一系列储户。 It makes each object responsible for having special savers if it isn't serializable with joblib.如果每个对象不能用 joblib 序列化，它就会让每个对象负责拥有特殊的保存程序。 Neat.整洁。 So just when things break, you have the option of creating your own serializer just for the object that breaks, this way you won't need to break encapsulation at save-time to dig into your objects manually, which would break the Law of Demeter.因此，当事情发生故障时，您可以选择为发生故障的对象创建自己的序列化程序，这样您就无需在保存时打破封装以手动挖掘您的对象，这会违反德米特法则.

Note that the savers also need to be able to reload the object when loading the save, too.请注意，保存程序也需要能够在加载保存时重新加载对象。 We already wrote a TensorFlow Neuraxle saver .我们已经编写了一个TensorFlow Neuraxle saver 。

TL;DR: You can call the save() method on any pipeline in Neuraxle , and if some steps define a custom Saver, then the step will use that saver before using the default JoblibStepSaver. TL;DR：您可以在Neuraxle 中的任何管道上调用 save() 方法，如果某些步骤定义了自定义 Saver，则该步骤将在使用默认 JoblibStepSaver 之前使用该 Saver。

Parallelization of your non-picklable pipeline不可选择的管道的并行化

So you've done the things above using Neuraxle.所以你已经使用 Neuraxle 完成了上面的事情。 Neat.整洁。 Now use Neuraxle's classes for the AutoML and random search and things like that.现在将 Neuraxle 的类用于 AutoML 和随机搜索等。 They should have the proper abstractions for parallelization using the savers to serialize things.他们应该有适当的并行化抽象，使用 savers 来序列化事物。 Things must be serialized to send your code to other python processes for parallelization.事情必须被序列化才能将您的代码发送到其他 Python 进程进行并行化。

Answer 3

I had the same problem, but in my case the issue was using function transformers where pickle sometimes has difficulties in serializing functions.我遇到了同样的问题，但在我的情况下，问题是使用函数转换器，其中pickle有时在序列化函数时遇到困难。 The solution for me was to use dill instead, though it is a bit slower.我的解决方案是改用dill ，虽然它有点慢。

Answer 4

In my case, I just had to restart the Ipython IDE where I was checking the transformer.在我的例子中，我只需要重新启动 Ipython IDE 来检查变压器。 After restarting IDE and re-running the code, it ether works well or starts giving you a more meaningful error.重新启动 IDE 并重新运行代码后，它运行良好或开始给您一个更有意义的错误。

自定义 sklearn 管道转换器给出“pickle.PicklingError”

问题描述

EDIT:编辑：

4 个解决方案

解决方案1
10 已采纳 2017-08-02 05:58:09

解决方案2
1 2020-03-06 01:51:49

Problem: You can't Parallelize nor Save Pipelines Using Steps that Can't be Serialized “as-is” by Joblib问题：您无法使用 Joblib 无法“按原样”序列化的步骤并行化或保存管道

Solution: Use a Chain of Savers in each Step解决方案：在每个步骤中使用一系列 Savers

Parallelization of your non-picklable pipeline不可选择的管道的并行化

解决方案3
0 2019-03-06 15:29:24

解决方案4
0 2022-03-25 23:02:36

自定义 sklearn 管道转换器给出“pickle.PicklingError”

问题描述

EDIT:编辑：

4 个解决方案

解决方案1 10 已采纳 2017-08-02 05:58:09

解决方案2 1 2020-03-06 01:51:49

Problem: You can't Parallelize nor Save Pipelines Using Steps that Can't be Serialized “as-is” by Joblib问题：您无法使用 Joblib 无法“按原样”序列化的步骤并行化或保存管道

Solution: Use a Chain of Savers in each Step解决方案：在每个步骤中使用一系列 Savers

Parallelization of your non-picklable pipeline不可选择的管道的并行化

解决方案3 0 2019-03-06 15:29:24

解决方案4 0 2022-03-25 23:02:36

解决方案1
10 已采纳 2017-08-02 05:58:09

解决方案2
1 2020-03-06 01:51:49

解决方案3
0 2019-03-06 15:29:24

解决方案4
0 2022-03-25 23:02:36