简体   繁体   English

Spark ML流水线逻辑回归产生的预测比R GLM更糟糕

[英]Spark ML Pipeline Logistic Regression Produces Much Worse Predictions Than R GLM

I used ML PipeLine to run logistic regression models but for some reasons I got worst results than R. I have done some researches and the only post that I found that is related to this issue is this . 我使用ML PipeLine运行逻辑回归模型,但是由于某些原因,我得到的结果比R要差。我进行了一些研究,发现的与此问题相关的唯一文章是 It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. 似乎Spark Logistic回归返回的模型使损失函数最小化,而R glm函数使用最大似然性。 The Spark model only got 71.3% of the records right while R can predict 95.55% of the cases correctly. Spark模型仅获得71.3%的记录,而R可以正确预测95.55%的情况。 I was wondering if I did something wrong on the set up and if there's a way to improve the prediction. 我想知道我在设置上是否做错了什么,是否有办法改善预测。 The below is my Spark code and R code- 以下是我的Spark代码和R代码-

Spark code 火花代码

partial model_input  
label,AGE,GENDER,Q1,Q2,Q3,Q4,Q5,DET_AGE_SQ  
1.0,39,0,0,1,0,0,1,31.55709342560551  
1.0,54,0,0,0,0,0,0,83.38062283737028  
0.0,51,0,1,1,1,0,0,35.61591695501733



def trainModel(df: DataFrame): PipelineModel = {  
  val lr  = new LogisticRegression().setMaxIter(100000).setTol(0.0000000000000001)  
  val pipeline = new Pipeline().setStages(Array(lr))  
  pipeline.fit(df)  
}

val meta =  NominalAttribute.defaultAttr.withName("label").withValues(Array("a", "b")).toMetadata

val assembler = new VectorAssembler().
  setInputCols(Array("AGE","GENDER","DET_AGE_SQ",
 "QA1","QA2","QA3","QA4","QA5")).
  setOutputCol("features")

val model = trainModel(model_input)
val pred= model.transform(model_input)  
pred.filter("label!=prediction").count

R code R代码

lr <- model_input %>% glm(data=., formula=label~ AGE+GENDER+Q1+Q2+Q3+Q4+Q5+DET_AGE_SQ,
          family=binomial)
pred <- data.frame(y=model_input$label,p=fitted(lr))
table(pred $y, pred $p>0.5)

Feel free to let me know if you need any other information. 如果您需要其他任何信息,请随时告诉我。 Thank you! 谢谢!

Edit 9/18/2015 I have tried increasing the maximum iteration and decreasing the tolerance dramatically. 编辑9/18/2015我尝试增加最大迭代次数并显着降低公差。 Unfortunately, it didn't improve the prediction. 不幸的是,它并没有改善预测。 It seems the model converged to a local minimum instead of the global minimum. 该模型似乎收敛于局部最小值而不是全局最小值。

It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. 似乎Spark Logistic回归返回的模型使损失函数最小化,而R glm函数使用最大似然性。

Minimization of a loss function is pretty much a definition of the linear models and both glm and ml.classification.LogisticRegression are no different here. 损失函数的最小化几乎是线性模型的定义, glmml.classification.LogisticRegression在这里没有什么不同。 Fundamental difference between these two is the way how it is achieved. 两者之间的根本区别在于实现方式。

All linear models from ML/MLlib are based on some variants of Gradient descent . ML / MLlib的所有线性模型均基于Gradient descent的某些变体。 Quality of the model generated using this approach vary on a case by case basis and depend on the Gradient Descent and regularization parameters. 使用这种方法生成的模型的质量视情况而定,并且取决于“梯度下降”和正则化参数。

R from the other hand computes an exact solution which, given its time complexity, is not well suited for large datasets. 另一方面,R计算精确的解决方案,鉴于其时间复杂度,它不太适合大型数据集。

As I've mentioned above quality of the model generated using GS depends on the input parameters so typical way to improve it is to perform hyperparameter optimization. 如前所述,使用GS生成的模型的质量取决于输入参数,因此改善模型的典型方法是执行超参数优化。 Unfortunately ML version is rather limited here compared to MLlib but for starters you can increase a number of iterations. 不幸的是,与MLlib相比,此处的ML版本相当有限,但是对于初学者来说,您可以增加许多迭代。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM