Spark Logistic回归用于二元分类：为预测2类应用新的阈值

Question

I am new to both Spark and Python. 我是Spark和Python的新手。 I used Spark to train a Logistic Regression model with just two classes (0 and 1). 我使用Spark训练了只有两个类（0和1）的Logistic回归模型。 I trained it using my train data frame. 我使用火车数据框对其进行了训练。

This is how my pipeline model was defined: 这是我的管道模型的定义方式：

# Model definition:
lr = LogisticRegression(featuresCol = "lr_features", labelCol = "targetvar")
# Pipeline definition:
lr_pipeline = Pipeline(stages = indexStages + encodeStages +[lr_assembler, lr])
# Fit the logistic regression model:
lrModel = lr_pipeline.fit(train)

Then I ran predictions using my test dataframe: 然后，我使用测试数据框进行了预测：

lr_predictions = lrModel.transform(test)

Now, my lr_predictions has a column 'probability' that looks like a nested list to me. 现在，我的lr_predictions有一个“概率”列，对我来说，它看起来像一个嵌套列表。 For example, its first cell contains: [1,2,[],[0.88,0.11]] 例如，其第一个单元格包含：[1,2，[]，[0.88,0.11]]
I assume, it means: the probability for class 1 (which is = 0) is 0.88, the probability for class 2 (which is = 1) is 0.11. 我假设这意味着：1类（即= 0）的概率为0.88，2类（即= 1）的概率为0.11。

By default (threshold = 0.5) this observation is predicted as 0. However, I found a value (bestThreshold) that maximizes the F-measure (in my case it's 0.21): 默认情况下（阈值= 0.5），此观察值预测为0。但是，我发现了一个值（bestThreshold），该值使F值最大化（在我的情况下为0.21）：

fMeasure = lr_summary.fMeasureByThreshold
bestThreshold = fMeasure.orderBy(fMeasure['F-Measure'].desc()).first().threshold

I would like to apply bestThreshold to the 'probability' column and get a new column ('pred_new', for example) that contains the class assignments (0 or 1) based on bestThreshold as opposed to 0.5. 我想将bestThreshold应用于'probability'列，并获得一个新列（例如'pred_new'），其中包含基于bestThreshold而不是0.5的类分配（0或1）。

I cannot use the code below, because 'probability' column is too complex: 我无法使用下面的代码，因为“概率”列太复杂了：

from pyspark.sql.functions import when
lr_predictions = lr_predictions.withColumn("prob_best", \
              when(lr_predictions["probability"] >= bestThreshold, 1).otherwise(0)

I feel I need to need to map the 'probability' to a new column based on a new threshold. 我觉得我需要根据新阈值将“概率”映射到新列。 But I am not sure how to do it - given this complex (for me) structure of the 'probability' column. 但是我不确定该怎么做-考虑到“概率”列的复杂结构（对我而言）。

Thank you so much for your advice! 非常感谢您的建议！

Answer 1

If lrModel is LogisticRegressionModel : 如果lrModel是LogisticRegressionModel ：

type(lrModel)
## pyspark.ml.classification.LogisticRegressionModel

You can use internal Java object to set threshold 您可以使用内部Java对象设置阈值

lrModel._java_obj.setThreshold(bestThreshold)

and transform: 并转换：

lrModel.transform(data)

You can do the same to modify rawPredictionCol , predictionCol and probabilityCol . 您可以执行相同的操作来修改rawPredictionCol ， predictionCol和probabilityCol rawPredictionCol

This should become part of the public API in the future (2.3): 将来，它应该成为公共API的一部分（2.3）：

lrModel.transform(data, {lrModel.threshold: bestThreshold})

You can also use UDF: 您也可以使用UDF：

from pyspark.sql.functions import udf, lit

@udf("integer")
def predict(v, threshold):
    return 0 if v[0] >= bestThreshold  else 1

lr_predictions.withColumn(
   "prob_best",
   predict(lr_predictions["probability"], lit(bestThreshold)))

Edit : 编辑：

With PipelineModel you can try to access LogisticRegressionModel (as in your previous question ) and do the same thing. 使用PipelineModel您可以尝试访问LogisticRegressionModel （与上一个问题相同）。

Spark Logistic回归用于二元分类：为预测2类应用新的阈值

问题描述

1 个解决方案

解决方案1
3 2017-12-08 04:03:21

Spark Logistic回归用于二元分类：为预测2类应用新的阈值

问题描述

1 个解决方案

解决方案1 3 2017-12-08 04:03:21

解决方案1
3 2017-12-08 04:03:21