简体   繁体   English

Spark:基于Column值的行过滤器

[英]Spark: Row filter based on Column value

I have millions of rows as dataframe like this: 我有数百万行像这样的数据帧:

val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")

scala> df.show(false)
+---+--------+
|id |status  |
+---+--------+
|id1|ACTIVE  |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE  |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+

Now I want to divide this data into three separate dataFrame like this: 现在我想将这些数据分成三个独立的dataFrame,如下所示:

  1. Only ACTIVE ids (like id2), say activeDF 只有ACTIVE ID(比如id2),比如activeDF
  2. Only INACTIVE ids (like id3), say inactiveDF 只有非活动ID(比如id3),比如说inactiveDF
  3. Having both ACTIVE and INACTIVE as status, say bothDF 将ACTIVE和INACTIVE都作为状态,比如两个DF

How can I calculate activeDF and inactiveDF ? 如何计算activeDFinactiveDF

I know that bothDF can be calculated like 我知道两个DF都可以像

df.select("id").distinct.except(activeDF).except(inactiveDF)

, but this will involve shuffling (as 'distinct' operation required same). ,但这将涉及改组(因为'不同'操作需要相同)。 Is there any better way to calculate bothDF 有没有更好的方法来计算两个DF

Versions: 版本:

Spark : 2.2.1
Scala : 2.11

The most elegant solution is to pivot on status 最优雅的解决方案是关注status

val counts = df
  .groupBy("id")
  .pivot("status", Seq("ACTIVE", "INACTIVE"))
  .count

or equivalent direct agg 或等同的直接agg

val counts = df
  .groupBy("id")
  .agg(
    count(when($"status" === "ACTIVE", true)) as "ACTIVE",
    count(when($"status" === "INACTIVE", true)) as "INACTIVE"
  )

followed by a simple CASE ... WHEN : 然后是一个简单的CASE ... WHEN

val result = counts.withColumn(
  "status",
  when($"ACTIVE" === 0, "INACTIVE")
    .when($"inactive" === 0, "ACTIVE")
    .otherwise("BOTH")
)

result.show
+---+------+--------+--------+                                                  
| id|ACTIVE|INACTIVE|  status|
+---+------+--------+--------+
|id3|     0|       2|INACTIVE|
|id1|     1|       2|    BOTH|
|id2|     1|       0|  ACTIVE|
+---+------+--------+--------+

Later you can separate the result with filters or dump to disk with source that supports partitionBy ( How to split a dataframe into dataframes with same column values? ). 稍后您可以将resultfilters分离或使用支持partitionBy源转储到磁盘( 如何将数据帧拆分为具有相同列值的数据帧? )。

just another way - groupBy, collect as set and then if the size of the set is 1, it is either active or inactive only, else both 只是另一种方式 - groupBy,按集合收集,然后如果集合的大小为1,它只是活动或非活动,否则两者都是

scala> val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE"), ("id4", "ACTIVE"), ("id5", "ACTIVE"), ("id6", "INACTIVE"), ("id7", "ACTIVE"), ("id7", "INACTIVE")).toDF("id", "status")
df: org.apache.spark.sql.DataFrame = [id: string, status: string]

scala> df.show(false)
+---+--------+
|id |status  |
+---+--------+
|id1|ACTIVE  |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE  |
|id3|INACTIVE|
|id3|INACTIVE|
|id4|ACTIVE  |
|id5|ACTIVE  |
|id6|INACTIVE|
|id7|ACTIVE  |
|id7|INACTIVE|
+---+--------+


scala> val allstatusDF = df.groupBy("id").agg(collect_set("status") as "allstatus")
allstatusDF: org.apache.spark.sql.DataFrame = [id: string, allstatus: array<string>]

scala> allstatusDF.show(false)
+---+------------------+
|id |allstatus         |
+---+------------------+
|id7|[ACTIVE, INACTIVE]|
|id3|[INACTIVE]        |
|id5|[ACTIVE]          |
|id6|[INACTIVE]        |
|id1|[ACTIVE, INACTIVE]|
|id2|[ACTIVE]          |
|id4|[ACTIVE]          |
+---+------------------+


scala> allstatusDF.withColumn("status", when(size($"allstatus") === 1, $"allstatus".getItem(0)).otherwise("BOTH")).show(false)
+---+------------------+--------+
|id |allstatus         |status  |
+---+------------------+--------+
|id7|[ACTIVE, INACTIVE]|BOTH    |
|id3|[INACTIVE]        |INACTIVE|
|id5|[ACTIVE]          |ACTIVE  |
|id6|[INACTIVE]        |INACTIVE|
|id1|[ACTIVE, INACTIVE]|BOTH    |
|id2|[ACTIVE]          |ACTIVE  |
|id4|[ACTIVE]          |ACTIVE  |
+---+------------------+--------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据列的最大值过滤火花数据框 - filter spark dataframe based on maximum value of a column 根据spark中上一行的同一列的值计算值 - Calculate value based on value from same column of the previous row in spark 如何根据 map 的列值过滤火花 dataframe 条目 - How to filter spark dataframe entries based on a column value which is a map 使用 Scala 根据前一行中不同列的计算值计算 Spark Dataframe 当前行中的列值 - Calculating column value in current row of Spark Dataframe based on the calculated value of a different column in previous row using Scala Spark根据row_number的最大值和另一列的字符串值创建一个新的字符串列 - Spark create a new string column based on max value of row_number and string value of another column Spark Notebook:如何根据列值(其中每个列单元格都是字符串数组)过滤行? - Spark Notebook: How can I filter rows based on a column value where each column cell is an array of strings? 根据多个条件过滤列:Scala Spark - Filter a column based on multiple conditions: Scala Spark Spark 根据行值选择列 - Spark select column based on row values 如何根据列值是否在Spark DataFrame的一组字符串中过滤行 - How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame Scala:根据预定义数组 Spark 1.6 中存在的列值过滤 DF 行 - Scala : Filter DF rows based upon column value which exists in a predefined Array Spark 1.6
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM