简体   繁体   English

如何 output 计数来自 Spark dataframe 的两个二进制列的所有成对组合的计数,即使它是零计数?

[英]How to output the count of all pairwise combination of two binary columns from a Spark dataframe even when it is zero count?

How to output the count of all pairwise combination of two binary(0/1) columns from a Spark dataframe even when the count is zero?即使计数为零,如何 output 计算来自 Spark dataframe 的两个二进制(0/1)列的所有成对组合的计数?

final_sdf.groupBy('actual', 'prediction').count().show()

Current output is当前 output 是

当前的

But my desired output includes the zero groups as below.但我想要的 output 包括如下零组。

期望的

Okay, the idea to do this, is first create the missing binary rows, allocate value count to 0, filter, then append the dataset.好的,这样做的想法是首先创建丢失的二进制行,将值计数分配给 0,过滤,然后 append 数据集。

Let's assume our main dataset is called df and looks as below:假设我们的主数据集名为df ,如下所示:

+------+----------+-----+
|actual|prediction|count|
+------+----------+-----+
|1     |1.0       |944  |
|0     |1.0       |208  |
+------+----------+-----+

First, let's create a column called array for example with value abs(actual - 1) , this way, we get the missing binary value.首先,让我们创建一个名为array的列,例如值为abs(actual - 1) ,这样,我们就得到了缺失的二进制值。 Then, we explode that back to prediction and we drop our array column.然后,我们将其分解回预测并删除我们的array列。

val df2 = df1
  .withColumn("array", array(col("actual"), abs(col("actual") - 1)))
  .withColumn("prediction", explode(col("array")))
  .drop("array")
+------+----------+-----+
|actual|prediction|count|
+------+----------+-----+
|1     |1         |944  |
|1     |0         |944  |
|0     |0         |208  |
|0     |1         |208  |
+------+----------+-----+

Then we do an anti join ( df1 and df2 ) and overwrite count value with 0.然后我们进行anti连接( df1df2 )并用 0 覆盖count数值。

val df3 = df2.join(df1, Seq("actual", "prediction", "count"), "anti")
  .withColumn("count", lit(0))
+------+----------+-----+
|actual|prediction|count|
+------+----------+-----+
|1     |0         |0    |
|0     |0         |0    |
+------+----------+-----+

Finally, we union these two dataframes:最后,我们合并这两个数据框:

df1.union(df3).show(10)
+------+----------+-----+
|actual|prediction|count|
+------+----------+-----+
|     1|       1.0|  944|
|     0|       1.0|  208|
|     1|       0.0|    0|
|     0|       0.0|    0|
+------+----------+-----+

which is I hope what you need!这就是我希望你所需要的!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM