如何将 sklearn 管道转换为 pyspark 管道？

Question

We have a machine learning classifier model that we have trained with a pandas dataframe and a standard sklearn pipeline (StandardScaler, RandomForestClassifier, GridSearchCV etc).我们有一个机器学习分类器 model，我们已经用 pandas dataframe 和标准 sklearn 管道（StandardScaler、RandomForestClassifier、GridSearchCV 等）进行了训练。 We are working on Databricks and would like to scale up this pipeline to a large dataset using the parallel computation spark offers.我们正在研究 Databricks，并希望使用并行计算 spark 提供的功能将此管道扩展到大型数据集。

What is the quickest way to convert our sklearn pipeline into something that computes in parallel ?将 sklearn 管道转换为并行计算的最快方法是什么？ (We can easily switch between pandas and spark DFs as required.) （我们可以根据需要轻松地在 pandas 和 spark DF 之间切换。）

For context, our options seem to be:对于上下文，我们的选择似乎是：

Rewrite the pipeline using MLLib (time-consuming)使用 MLLib 重写流水线（耗时）
Use a sklearn-spark bridging library使用 sklearn-spark 桥接库

On option 2, Spark-Sklearn seems to be deprecated , but Databricks instead recommends that we use joblibspark.在选项 2 上，Spark-Sklearn 似乎已被弃用，但 Databricks 相反建议我们使用 joblibspark。 However, this raises an exception on Databricks:但是，这会在 Databricks 上引发异常：

from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from joblibspark import register_spark
from sklearn.utils import parallel_backend
register_spark() # register spark backend

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')

clf = GridSearchCV(svr, parameters, cv=5)
with parallel_backend('spark', n_jobs=3):
    clf.fit(iris.data, iris.target)

raises加薪

py4j.security.Py4JSecurityException: Method public int org.apache.spark.SparkContext.maxNumConcurrentTasks() is not whitelisted on class class org.apache.spark.SparkContext

Answer 1

According to the Databricks instructions ( here and here ), the necessary requirements are:根据 Databricks 说明（此处和此处），必要的要求是：

Python 3.6+ Python 3.6+
pyspark>=2.4
scikit-learn>=0.21
joblib>=0.14

I cannot reproduce your issue in a community Databricks cluster running Python 3.7.5, Spark 3.0.0, scikit-learn 0.22.1, and joblib 0.14.1:我无法在运行 Python 3.7.5、Spark 3.0.0、scikit-learn 0.22.1 和 joblib 0.14.1 的社区 Databricks 集群中重现您的问题：

import sys
import sklearn
import joblib

spark.version
# '3.0.0'

sys.version
# '3.7.5 (default, Nov  7 2019, 10:50:52) \n[GCC 8.3.0]'

sklearn.__version__
# '0.22.1'

joblib.__version__
# '0.14.1'

With the above settings, your code snippet runs smoothly, and produces indeed a classifier clf as:通过以上设置，您的代码片段可以顺利运行，并且确实会生成一个分类器clf ，如下所示：

GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

as does the alternative example from here :和这里的替代示例一样：

from sklearn.utils import parallel_backend
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from joblibspark import register_spark

register_spark() # register spark backend

iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
with parallel_backend('spark', n_jobs=3):
  scores = cross_val_score(clf, iris.data, iris.target, cv=5)

print(scores)

giving给予

[0.96666667 1.         0.96666667 0.96666667 1.        ]

Answer 2

Thanks to desertnaut for the response - this answer should be correct for a standard Spark / Databricks setup, so have accepted it, given the wording of my question / potential usefulness for other readers感谢 desertnaut 的回复 - 这个答案对于标准的 Spark / Databricks 设置应该是正确的，所以接受它，考虑到我的问题的措辞/对其他读者的潜在用处

Contributing a separate "answer" having discovered what the issue was in our case: Databricks support advised that the issue in our case was due to our using a special type of cluster (High Concurrency with credentials passthrough enabled, on AWS).贡献一个单独的“答案”发现了我们案例中的问题：Databricks 支持建议我们案例中的问题是由于我们使用了一种特殊类型的集群（在 AWS 上启用了凭证直通的高并发）。 grid.fit() was not whitelisted for this type of cluster, and Databricks advised that they would need to raise it with their engineering team to whitelist it. grid.fit() 没有被列入此类集群的白名单，Databricks 建议他们需要与他们的工程团队一起将其列入白名单。

如何将 sklearn 管道转换为 pyspark 管道？

问题描述

2 个解决方案

解决方案1
5 已采纳 2020-09-01 12:53:39

解决方案2
0 2020-09-03 17:18:18

如何将 sklearn 管道转换为 pyspark 管道？

问题描述

2 个解决方案

解决方案1 5 已采纳 2020-09-01 12:53:39

解决方案2 0 2020-09-03 17:18:18

解决方案1
5 已采纳 2020-09-01 12:53:39

解决方案2
0 2020-09-03 17:18:18