Spark：基于Column值的行过滤器

Question

I have millions of rows as dataframe like this: 我有数百万行像这样的数据帧：

val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")

scala> df.show(false)
+---+--------+
|id |status  |
+---+--------+
|id1|ACTIVE  |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE  |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+

Now I want to divide this data into three separate dataFrame like this: 现在我想将这些数据分成三个独立的dataFrame，如下所示：

Only ACTIVE ids (like id2), say activeDF 只有ACTIVE ID（比如id2），比如activeDF
Only INACTIVE ids (like id3), say inactiveDF 只有非活动ID（比如id3），比如说inactiveDF
Having both ACTIVE and INACTIVE as status, say bothDF 将ACTIVE和INACTIVE都作为状态，比如两个DF

How can I calculate activeDF and inactiveDF ? 如何计算activeDF和inactiveDF ？

I know that bothDF can be calculated like 我知道两个DF都可以像

df.select("id").distinct.except(activeDF).except(inactiveDF)

, but this will involve shuffling (as 'distinct' operation required same). ，但这将涉及改组（因为'不同'操作需要相同）。 Is there any better way to calculate bothDF 有没有更好的方法来计算两个DF

Versions: 版本：

Spark : 2.2.1
Scala : 2.11

Answer 1

The most elegant solution is to pivot on status 最优雅的解决方案是关注status

val counts = df
  .groupBy("id")
  .pivot("status", Seq("ACTIVE", "INACTIVE"))
  .count

or equivalent direct agg 或等同的直接agg

val counts = df
  .groupBy("id")
  .agg(
    count(when($"status" === "ACTIVE", true)) as "ACTIVE",
    count(when($"status" === "INACTIVE", true)) as "INACTIVE"
  )

followed by a simple CASE ... WHEN : 然后是一个简单的CASE ... WHEN ：

val result = counts.withColumn(
  "status",
  when($"ACTIVE" === 0, "INACTIVE")
    .when($"inactive" === 0, "ACTIVE")
    .otherwise("BOTH")
)

result.show

+---+------+--------+--------+                                                  
| id|ACTIVE|INACTIVE|  status|
+---+------+--------+--------+
|id3|     0|       2|INACTIVE|
|id1|     1|       2|    BOTH|
|id2|     1|       0|  ACTIVE|
+---+------+--------+--------+

Later you can separate the result with filters or dump to disk with source that supports partitionBy ( How to split a dataframe into dataframes with same column values? ). 稍后您可以将result与filters分离或使用支持partitionBy源转储到磁盘（如何将数据帧拆分为具有相同列值的数据帧？）。

Answer 2

just another way - groupBy, collect as set and then if the size of the set is 1, it is either active or inactive only, else both 只是另一种方式 - groupBy，按集合收集，然后如果集合的大小为1，它只是活动或非活动，否则两者都是

scala> val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE"), ("id4", "ACTIVE"), ("id5", "ACTIVE"), ("id6", "INACTIVE"), ("id7", "ACTIVE"), ("id7", "INACTIVE")).toDF("id", "status")
df: org.apache.spark.sql.DataFrame = [id: string, status: string]

scala> df.show(false)
+---+--------+
|id |status  |
+---+--------+
|id1|ACTIVE  |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE  |
|id3|INACTIVE|
|id3|INACTIVE|
|id4|ACTIVE  |
|id5|ACTIVE  |
|id6|INACTIVE|
|id7|ACTIVE  |
|id7|INACTIVE|
+---+--------+


scala> val allstatusDF = df.groupBy("id").agg(collect_set("status") as "allstatus")
allstatusDF: org.apache.spark.sql.DataFrame = [id: string, allstatus: array<string>]

scala> allstatusDF.show(false)
+---+------------------+
|id |allstatus         |
+---+------------------+
|id7|[ACTIVE, INACTIVE]|
|id3|[INACTIVE]        |
|id5|[ACTIVE]          |
|id6|[INACTIVE]        |
|id1|[ACTIVE, INACTIVE]|
|id2|[ACTIVE]          |
|id4|[ACTIVE]          |
+---+------------------+


scala> allstatusDF.withColumn("status", when(size($"allstatus") === 1, $"allstatus".getItem(0)).otherwise("BOTH")).show(false)
+---+------------------+--------+
|id |allstatus         |status  |
+---+------------------+--------+
|id7|[ACTIVE, INACTIVE]|BOTH    |
|id3|[INACTIVE]        |INACTIVE|
|id5|[ACTIVE]          |ACTIVE  |
|id6|[INACTIVE]        |INACTIVE|
|id1|[ACTIVE, INACTIVE]|BOTH    |
|id2|[ACTIVE]          |ACTIVE  |
|id4|[ACTIVE]          |ACTIVE  |
+---+------------------+--------+

Spark：基于Column值的行过滤器

问题描述

2 个解决方案

解决方案1
2 2019-06-26 16:21:05

解决方案2
1 已采纳 2019-06-26 16:38:41

Spark：基于Column值的行过滤器

问题描述

2 个解决方案

解决方案1 2 2019-06-26 16:21:05

解决方案2 1 已采纳 2019-06-26 16:38:41

解决方案1
2 2019-06-26 16:21:05

解决方案2
1 已采纳 2019-06-26 16:38:41