简体   繁体   English

在Apache Spark MLib中预测Logistic回归模型中的概率

[英]Predicting Probabilities in Logistic Regression Model in Apache Spark MLib

I am working on Apache Spark to build the LRM using the LogisticRegressionWithLBFGS() class provided by MLib. 我正在Apache Spark上使用MLib提供的LogisticRegressionWithLBFGS()类构建LRM。 Once the Model is built, we can use the predict function provided which gives only the binary labels as the output. 构建模型后,我们可以使用提供的预测函数,该函数仅提供二进制标签作为输出。 I also want the probabilities to be calculated for the same. 我也希望针对相同的概率进行计算。

There is an implementation for the same found in 有一个实现可以在

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala

override protected def predictPoint(
  dataMatrix: Vector,
  weightMatrix: Vector,
  intercept: Double) = {
require(dataMatrix.size == numFeatures)

// If dataMatrix and weightMatrix have the same dimension, it's binary logistic regression.
if (numClasses == 2) {
  val margin = dot(weightMatrix, dataMatrix) + intercept
  val score = 1.0 / (1.0 + math.exp(-margin))
  threshold match {
    case Some(t) => if (score > t) 1.0 else 0.0
    case None => score
  }
} 

This method is not exposed, and also the probabilities are not available. 此方法未公开,并且概率不可用。 Can I know how to use this function to get probabilities. 我可以知道如何使用此函数来获取概率。 The dot method which is used in the above function is also not exposed, it is present in the BLAS Package but it is not public. 在上述功能中使用的点方法也没有公开,它存在于BLAS软件包中,但不是公开的。

Call myModel.clearThreshold to get the raw prediction instead of the 0/1 labels. 调用myModel.clearThreshold以获取原始预测,而不是0/1标签。

Mind this only works for Binary Logistic Regression (numClasses == 2). 请注意,这仅适用于二进制逻辑回归(numClasses == 2)。

I encountered a similar problem in trying to obtain the raw predictions for a multiples problem. 我在尝试获取倍数问题的原始预测时遇到了类似的问题。 For me, the best solution was to create a method by borrowing and customizing from the Spark MLlib Logistic Regression src . 对我来说,最好的解决方案是通过从Spark MLlib Logistic Regression src借用和自定义来创建一种方法。 You can create a like so: 您可以这样创建:

object ClassificationUtility {
  def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
    (Double, Array[Double]) = {
    require(dataMatrix.size == model.numFeatures)
    val dataWithBiasSize: Int = model.weights.size / (model.numClasses - 1)
    val weightsArray: Array[Double] = model.weights match {
      case dv: DenseVector => dv.values
      case _ =>
        throw new IllegalArgumentException(
          s"weights only supports dense vector but got type ${model.weights.getClass}.")
    }
    var bestClass = 0
    var maxMargin = 0.0
    val withBias = dataMatrix.size + 1 == dataWithBiasSize
    val classProbabilities: Array[Double] = new Array[Double](model.numClasses)
    (0 until model.numClasses - 1).foreach { i =>
      var margin = 0.0
      dataMatrix.foreachActive { (index, value) =>
      if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
      }
      // Intercept is required to be added into margin.
      if (withBias) {
        margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
      }
      if (margin > maxMargin) {
        maxMargin = margin
        bestClass = i + 1
      }
      classProbabilities(i+1) = 1.0 / (1.0 + Math.exp(-(margin - maxMargin)))
    }
    return (bestClass.toDouble, classProbabilities)
  }
}

Note it is only slightly different from the original method, it just calculates the logistic as a function of the input features. 请注意,它与原始方法仅稍有不同,它只是根据输入要素来计算逻辑量。 It also defines some vals and vars that are originally private and included outside of this method. 它还定义了一些val和var,它们最初是私有的,并包含在此方法之外。 Ultimately, it indexes the scores in an Array and returns it along with the best answer. 最终,它会在数组中索引分数并将其与最佳答案一起返回。 I call my method like so: 我这样调用我的方法:

// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
  .map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
  .predictPoint(features, model)
(prediction, label, probabilities)}

However: 然而:

It seems the Spark contributors are discouraging the use of MLlib in favor of ML . 似乎Spark贡献者不鼓励使用MLlib来支持ML The ML logistic regression API currently does not support multiples classification. ML Logistic回归API当前不支持倍数分类。 I am now using OneVsRest which acts as a wrapper for one vs all classification. 我现在使用的是OneVsRest ,它可以作为一个对所有分类的包装。 I am working on a similar customization to get the raw scores. 我正在进行类似的自定义以获取原始分数。

I believe the call is myModel.clearThreshold() ; 我相信电话是myModel.clearThreshold() ; ie myModel.clearThreshold without the parentheses fails. 即不带括号的myModel.clearThreshold失败。 See the linear SVM example here . 请参见此处的线性SVM示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM