在scikit-learn中从pyspark复制逻辑回归模型

Question

Problem: The default implementations (no custom parameters set) of the logistic regression model in pyspark and scikit-learn seem to yield different results given their default paramter values. 问题： pyspark和scikit-learn中逻辑回归模型的默认实现（没有自定义参数设置）似乎在给定默认参数值的情况下会产生不同的结果。

I am trying to replicate a result from logistic regression (no custom paramters set) performed with pypark (see: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression ) with the logistic regression model from scikit-learn (see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html ). 我试图复制使用pypark执行的逻辑回归（没有自定义参数集）的结果（请参阅： https ：//spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml 。 classification.LogisticRegression ）使用来自scikit-learn的逻辑回归模型（参见： http ：//scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html）。

It appears to me that both model implementations (in pyspark and scikit) do not possess the same parameters, so i cant just simply match the paramteres in scikit to fit those in pyspark. 在我看来，两个模型实现（在pyspark和scikit中）都没有相同的参数，所以我不能简单地匹配scikit中的paramteres以适应pyspark中的那些。 Is there any solution on how to match both models on their default configuration? 有没有关于如何在默认配置上匹配两个模型的解决方案？

Parameters Scikit model (default parameters): 参数Scikit模型（默认参数）：

`LogisticRegression(
C=1.0, 
class_weight=None, 
dual=False, 
fit_intercept=True,
intercept_scaling=1, 
max_iter=100, 
multi_class='ovr', 
n_jobs=1,
penalty='l2', 
random_state=None, 
solver='liblinear', 
tol=0.0001,
verbose=0, 
warm_start=False`

Parameters Pyspark model (default parameters): 参数Pyspark模型（默认参数）：

LogisticRegression(self, 
featuresCol="features", 
labelCol="label", 
predictionCol="prediction", 
maxIter=100,
regParam=0.0, 
elasticNetParam=0.0, 
tol=1e-6, 
fitIntercept=True, 
threshold=0.5, 
thresholds=None, 
probabilityCol="probability", 
rawPredictionCol="rawPrediction", 
standardization=True, 
weightCol=None, 
aggregationDepth=2, 
family="auto")

Thank you very much! 非常感谢你！

Answer 1

pyspark's LR uses ElasticNet regularization, which is a weighted sum of L1 and L2 terms; pyspark的LR使用ElasticNet正则化，它是L1和L2项的加权和; weight is elasticNetParam . 重量是elasticNetParam 。 So with elasticNetParam=0 you get L2 regularization, and regParam is L2 regularization coefficient; 因此，使用elasticNetParam=0您可以获得L2正则化，并且regParam是L2正则化系数; with elasticNetParam=1 you get L1 regularization, and regParam is L1 regularization coefficient. 使用elasticNetParam=1可以得到L1正则化， regParam是L1正则化系数。 C in sklearn LogisticRegression is inverse of regParam , ie regParam = 1/C . sklearn中的C LogisticRegression与regParam ，即regParam = 1/C

Also, default training methods are different; 此外，默认的培训方法也不同; you may need to set solver='lbfgs' in sklearn LogisticRegression to make training methods more similar. 您可能需要在sklearn LogisticRegression中设置solver ='lbfgs'以使训练方法更相似。 It only works with L2 though. 它只适用于L2。

If you need ElasticNet regularization (ie 0 < elasticNetParam < 1), then sklearn implements it in SGDClassifier - set loss='elasticnet' , alpha would be similar to regParam (and you don't have to inverse it, like C), and l1_ratio would be elasticNetParam . 如果需要ElasticNet正规化（即0 <elasticNetParam <1），然后sklearn实现它在SGDClassifier -设定loss='elasticnet' ， alpha将类似于regParam （你不必逆它，像C），和l1_ratio将是elasticNetParam 。

sklearn doesn't provide threshold directly, but you can use predict_proba instead of predict, and then apply the threshold yourselves. sklearn不直接提供阈值，但您可以使用predict_proba而不是预测，然后自己应用阈值。

Disclaimer: I have zero spark experience, the answer is based on sklearn and spark docs. 免责声明：我有零火花体验，答案是基于sklearn和spark docs。

Answer 2

By now I figured out that as indicated by the parameter standardization=True pyspark does standardize the data within the model whereas scikit doesn't. 到现在为止，我发现如参数standardization=True pyspark所示，标准化模型中的数据，而scikit则没有。 Implementing preprocessing.scale before applying the scikit model gave me close matching results for both models 在应用scikit模型之前实现preprocessing.scale给了我两个模型的紧密匹配结果

在scikit-learn中从pyspark复制逻辑回归模型

问题描述

2 个解决方案

解决方案1
4 2017-06-18 09:56:36

解决方案2
3 已采纳 2017-06-18 09:52:04

在scikit-learn中从pyspark复制逻辑回归模型

问题描述

2 个解决方案

解决方案1 4 2017-06-18 09:56:36

解决方案2 3 已采纳 2017-06-18 09:52:04

解决方案1
4 2017-06-18 09:56:36

解决方案2
3 已采纳 2017-06-18 09:52:04