[英]Replicate logistic regression model from pyspark in scikit-learn
Problem: The default implementations (no custom parameters set) of the logistic regression model in pyspark and scikit-learn seem to yield different results given their default paramter values. 问题: pyspark和scikit-learn中逻辑回归模型的默认实现(没有自定义参数设置)似乎在给定默认参数值的情况下会产生不同的结果。
I am trying to replicate a result from logistic regression (no custom paramters set) performed with pypark (see: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression ) with the logistic regression model from scikit-learn (see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html ). 我试图复制使用pypark执行的逻辑回归(没有自定义参数集)的结果(请参阅: https ://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml 。 classification.LogisticRegression )使用来自scikit-learn的逻辑回归模型(参见: http ://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)。
It appears to me that both model implementations (in pyspark and scikit) do not possess the same parameters, so i cant just simply match the paramteres in scikit to fit those in pyspark. 在我看来,两个模型实现(在pyspark和scikit中)都没有相同的参数,所以我不能简单地匹配scikit中的paramteres以适应pyspark中的那些。 Is there any solution on how to match both models on their default configuration? 有没有关于如何在默认配置上匹配两个模型的解决方案?
Parameters Scikit model (default parameters): 参数Scikit模型(默认参数):
`LogisticRegression(
C=1.0,
class_weight=None,
dual=False,
fit_intercept=True,
intercept_scaling=1,
max_iter=100,
multi_class='ovr',
n_jobs=1,
penalty='l2',
random_state=None,
solver='liblinear',
tol=0.0001,
verbose=0,
warm_start=False`
Parameters Pyspark model (default parameters): 参数Pyspark模型(默认参数):
LogisticRegression(self,
featuresCol="features",
labelCol="label",
predictionCol="prediction",
maxIter=100,
regParam=0.0,
elasticNetParam=0.0,
tol=1e-6,
fitIntercept=True,
threshold=0.5,
thresholds=None,
probabilityCol="probability",
rawPredictionCol="rawPrediction",
standardization=True,
weightCol=None,
aggregationDepth=2,
family="auto")
Thank you very much! 非常感谢你!
pyspark's LR uses ElasticNet regularization, which is a weighted sum of L1 and L2 terms; pyspark的LR使用ElasticNet正则化,它是L1和L2项的加权和; weight is elasticNetParam
. 重量是elasticNetParam
。 So with elasticNetParam=0
you get L2 regularization, and regParam
is L2 regularization coefficient; 因此,使用elasticNetParam=0
您可以获得L2正则化,并且regParam
是L2正则化系数; with elasticNetParam=1
you get L1 regularization, and regParam
is L1 regularization coefficient. 使用elasticNetParam=1
可以得到L1正则化, regParam
是L1正则化系数。 C
in sklearn LogisticRegression is inverse of regParam
, ie regParam = 1/C
. sklearn中的C
LogisticRegression与regParam
,即regParam = 1/C
Also, default training methods are different; 此外,默认的培训方法也不同; you may need to set solver='lbfgs' in sklearn LogisticRegression to make training methods more similar. 您可能需要在sklearn LogisticRegression中设置solver ='lbfgs'以使训练方法更相似。 It only works with L2 though. 它只适用于L2。
If you need ElasticNet regularization (ie 0 < elasticNetParam < 1), then sklearn implements it in SGDClassifier - set loss='elasticnet'
, alpha
would be similar to regParam
(and you don't have to inverse it, like C), and l1_ratio
would be elasticNetParam
. 如果需要ElasticNet正规化(即0 <elasticNetParam <1),然后sklearn实现它在SGDClassifier -设定loss='elasticnet'
, alpha
将类似于regParam
(你不必逆它,像C),和l1_ratio
将是elasticNetParam
。
sklearn doesn't provide threshold directly, but you can use predict_proba instead of predict, and then apply the threshold yourselves. sklearn不直接提供阈值,但您可以使用predict_proba而不是预测,然后自己应用阈值。
Disclaimer: I have zero spark experience, the answer is based on sklearn and spark docs. 免责声明:我有零火花体验,答案是基于sklearn和spark docs。
By now I figured out that as indicated by the parameter standardization=True
pyspark does standardize the data within the model whereas scikit doesn't. 到现在为止,我发现如参数standardization=True
pyspark所示,标准化模型中的数据,而scikit则没有。 Implementing preprocessing.scale
before applying the scikit model gave me close matching results for both models 在应用scikit模型之前实现preprocessing.scale
给了我两个模型的紧密匹配结果
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.