简体   繁体   English

在scikit-learn中从pyspark复制逻辑回归模型

[英]Replicate logistic regression model from pyspark in scikit-learn

Problem: The default implementations (no custom parameters set) of the logistic regression model in pyspark and scikit-learn seem to yield different results given their default paramter values. 问题: pyspark和scikit-learn中逻辑回归模型的默认实现(没有自定义参数设置)似乎在给定默认参数值的情况下会产生不同的结果。

I am trying to replicate a result from logistic regression (no custom paramters set) performed with pypark (see: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression ) with the logistic regression model from scikit-learn (see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html ). 我试图复制使用pypark执行的逻辑回归(没有自定义参数集)的结果(请参阅: https ://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml 。 classification.LogisticRegression )使用来自scikit-learn的逻辑回归模型(参见: http ://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)。

It appears to me that both model implementations (in pyspark and scikit) do not possess the same parameters, so i cant just simply match the paramteres in scikit to fit those in pyspark. 在我看来,两个模型实现(在pyspark和scikit中)都没有相同的参数,所以我不能简单地匹配scikit中的paramteres以适应pyspark中的那些。 Is there any solution on how to match both models on their default configuration? 有没有关于如何在默认配置上匹配两个模型的解决方案?

Parameters Scikit model (default parameters): 参数Scikit模型(默认参数):

`LogisticRegression(
C=1.0, 
class_weight=None, 
dual=False, 
fit_intercept=True,
intercept_scaling=1, 
max_iter=100, 
multi_class='ovr', 
n_jobs=1,
penalty='l2', 
random_state=None, 
solver='liblinear', 
tol=0.0001,
verbose=0, 
warm_start=False`

Parameters Pyspark model (default parameters): 参数Pyspark模型(默认参数):

LogisticRegression(self, 
featuresCol="features", 
labelCol="label", 
predictionCol="prediction", 
maxIter=100,
regParam=0.0, 
elasticNetParam=0.0, 
tol=1e-6, 
fitIntercept=True, 
threshold=0.5, 
thresholds=None, 
probabilityCol="probability", 
rawPredictionCol="rawPrediction", 
standardization=True, 
weightCol=None, 
aggregationDepth=2, 
family="auto")

Thank you very much! 非常感谢你!

pyspark's LR uses ElasticNet regularization, which is a weighted sum of L1 and L2 terms; pyspark的LR使用ElasticNet正则化,它是L1和L2项的加权和; weight is elasticNetParam . 重量是elasticNetParam So with elasticNetParam=0 you get L2 regularization, and regParam is L2 regularization coefficient; 因此,使用elasticNetParam=0您可以获得L2正则化,并且regParam是L2正则化系数; with elasticNetParam=1 you get L1 regularization, and regParam is L1 regularization coefficient. 使用elasticNetParam=1可以得到L1正则化, regParam是L1正则化系数。 C in sklearn LogisticRegression is inverse of regParam , ie regParam = 1/C . sklearn中的C LogisticRegression与regParam ,即regParam = 1/C

Also, default training methods are different; 此外,默认的培训方法也不同; you may need to set solver='lbfgs' in sklearn LogisticRegression to make training methods more similar. 您可能需要在sklearn LogisticRegression中设置solver ='lbfgs'以使训练方法更相似。 It only works with L2 though. 它只适用于L2。

If you need ElasticNet regularization (ie 0 < elasticNetParam < 1), then sklearn implements it in SGDClassifier - set loss='elasticnet' , alpha would be similar to regParam (and you don't have to inverse it, like C), and l1_ratio would be elasticNetParam . 如果需要ElasticNet正规化(即0 <elasticNetParam <1),然后sklearn实现它在SGDClassifier -设定loss='elasticnet'alpha将类似于regParam (你不必逆它,像C),和l1_ratio将是elasticNetParam

sklearn doesn't provide threshold directly, but you can use predict_proba instead of predict, and then apply the threshold yourselves. sklearn不直接提供阈值,但您可以使用predict_proba而不是预测,然后自己应用阈值。

Disclaimer: I have zero spark experience, the answer is based on sklearn and spark docs. 免责声明:我有零火花体验,答案是基于sklearn和spark docs。

By now I figured out that as indicated by the parameter standardization=True pyspark does standardize the data within the model whereas scikit doesn't. 到现在为止,我发现如参数standardization=True pyspark所示,标准化模型中的数据,而scikit则没有。 Implementing preprocessing.scale before applying the scikit model gave me close matching results for both models 在应用scikit模型之前实现preprocessing.scale给了我两个模型的紧密匹配结果

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM