如何在Spark结构化流中迭代分组的行以产生多个行？

Question

I have the input data set like: 我有像这样的输入数据集：

id     operation          value
1      null                1
1      discard             0
2      null                1
2      null                2
2      max                 0
3      null                1
3      null                1
3      list                0

I want to group the input and produce rows according to "operation" column. 我想对输入进行分组并根据“操作”列生成行。

for group 1, operation="discard", then the output is null, 对于组1，operation =“ discard”，则输出为null，

for group 2, operation="max", the output is: 对于组2，operation =“ max”，输出为：

2      null                2

for group 3, operation="list", the output is: 对于组3，operation =“ list”，输出为：

3      null                1
3      null                1

So finally the output is like: 所以最终输出是这样的：

  id     operation          value
   2      null                2
   3      null                1
   3      null                1

Is there a solution for this? 有解决方案吗？

I know there is a similar question how-to-iterate-grouped-data-in-spark But the differences compared to that are: 我知道有一个类似的问题如何在火花中分组数据，但与之相比的区别是：

1. I want to produce more than one row for each grouped data. 我想为每个分组数据产生多个行。 Possible and how? 可能和如何？
2. I want my logic to be easily extended for more operation to be added in future. 我希望我的逻辑易于扩展，以便将来增加更多操作。 So User-defined aggregate functions (aka UDAF) is the only possible solution? 因此，用户定义的聚合函数（又名UDAF）是唯一可能的解决方案？

Update 1: 更新1：

Thank stack0114106, then more details according to his answer, eg for id=1, operation="max", I want to iterate all the item with id=2, and find the max value, rather than assign a hard-coded value, that's why I want to iterate the rows in each group. 谢谢stack0114106，然后根据他的回答提供了更多详细信息，例如对于id = 1，operation =“ max”，我想迭代id = 2的所有项目，并找到最大值，而不是分配硬编码的值，这就是为什么我要迭代每个组中的行。 Below is a updated example: 下面是一个更新的示例：

The input: 输入：

scala> val df = Seq((0,null,1),(0,"discard",0),(1,null,1),(1,null,2),(1,"max",0),(2,null,1),(2,null,3),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id"
,"operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]

scala> df.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|0  |null     |1    |
|0  |discard  |0    |
|1  |null     |1    |
|1  |null     |2    |
|1  |max      |0    |
|2  |null     |1    |
|2  |null     |3    |
|2  |max      |0    |
|3  |null     |1    |
|3  |null     |1    |
|3  |list     |0    |
+---+---------+-----+

The expected output: 预期输出：

+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1  |null     |2    |
|2  |null     |3    |
|3  |null     |1    |
|3  |null     |1    |
+---+---------+-----+

Answer 1

group everything collecting the values, then write logic for each operation : 将收集值的所有内容分组，然后为每个操作编写逻辑：

import org.apache.spark.sql.functions._
val grouped=df.groupBy($"id").agg(max($"operation").as("op"),collect_list($"value").as("vals"))
val maxs=grouped.filter($"op"==="max").withColumn("val",explode($"vals")).groupBy($"id").agg(max("val").as("value"))
val lists=grouped.filter($"op"==="list").withColumn("value",explode($"vals")).filter($"value"!==0).select($"id",$"value")
//we don't collect the "discard"
//and we can add additional subsets for new "operations"
val result=maxs.union(lists)
//if you need the null in "operation" column add it with withColumn

Answer 2

You can use flatMap operation on the dataframe and generate required rows based on the conditions that you mentioned. 您可以对数据框使用flatMap操作，并根据您提到的条件生成所需的行。 Check this out 看一下这个

scala> val df = Seq((1,null,1),(1,"discard",0),(2,null,1),(2,null,2),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id","operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]

scala> df.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1  |null     |1    |
|1  |discard  |0    |
|2  |null     |1    |
|2  |null     |2    |
|2  |max      |0    |
|3  |null     |1    |
|3  |null     |1    |
|3  |list     |0    |
+---+---------+-----+


scala> df.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => (0,0) case "max" => (1,2) case "list" => (2,1) } ; (0
 until s._1).map( i => (r.getInt(0),null,s._2) ) }).show(false)
+---+----+---+
|_1 |_2  |_3 |
+---+----+---+
|2  |null|2  |
|3  |null|1  |
|3  |null|1  |
+---+----+---+

Spark assigns _1,_2 etc.. so you can map them to actual names by assigning them as below Spark会分配_1，_2等。因此，您可以通过如下分配它们来将它们映射为实际名称

scala> val df2 = df.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => (0,0) case "max" => (1,2) case "list" => (2,1) } ; (0 until s._1).map( i => (r.getInt(0),null,s._2) ) }).toDF("id","operation","value")
df2: org.apache.spark.sql.DataFrame = [id: int, operation: null ... 1 more field]

scala> df2.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|2  |null     |2    |
|3  |null     |1    |
|3  |null     |1    |
+---+---------+-----+


scala>

EDIT1: EDIT1：

Since you need the max(value) for each id, you can use window functions and get the max value in a new column, then use the same technique and get the results. 由于每个ID都需要max（value），因此可以使用窗口函数并在新列中获取最大值，然后使用相同的技术并获取结果。 Check this out 看一下这个

scala> val df =   Seq((0,null,1),(0,"discard",0),(1,null,1),(1,null,2),(1,"max",0),(2,null,1),(2,null,3),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id","operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]

scala> df.createOrReplaceTempView("michael")

scala> val df2 = spark.sql(""" select *, max(value) over(partition by id) mx from michael """)
df2: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 2 more fields]

scala> df2.show(false)
+---+---------+-----+---+
|id |operation|value|mx |
+---+---------+-----+---+
|1  |null     |1    |2  |
|1  |null     |2    |2  |
|1  |max      |0    |2  |
|3  |null     |1    |1  |
|3  |null     |1    |1  |
|3  |list     |0    |1  |
|2  |null     |1    |3  |
|2  |null     |3    |3  |
|2  |max      |0    |3  |
|0  |null     |1    |1  |
|0  |discard  |0    |1  |
+---+---------+-----+---+


scala> val df3 = df2.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => 0 case "max" => 1 case "list" => 2 } ; (0 until s).map( i => (r.getInt(0),null,r.getInt(3) )) }).toDF("id","operation","value")
df3: org.apache.spark.sql.DataFrame = [id: int, operation: null ... 1 more field]


scala> df3.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1  |null     |2    |
|3  |null     |1    |
|3  |null     |1    |
|2  |null     |3    |
+---+---------+-----+


scala>

如何在Spark结构化流中迭代分组的行以产生多个行？

问题描述

2 个解决方案

解决方案1
1 2019-01-01 04:14:21

解决方案2
0 已采纳 2018-12-31 20:59:48

如何在Spark结构化流中迭代分组的行以产生多个行？

问题描述

2 个解决方案

解决方案1 1 2019-01-01 04:14:21

解决方案2 0 已采纳 2018-12-31 20:59:48

解决方案1
1 2019-01-01 04:14:21

解决方案2
0 已采纳 2018-12-31 20:59:48