简体   繁体   English

如何在Spark结构化流中迭代分组的行以产生多个行?

[英]How to iterate grouped rows to produce multiple rows in spark structured streaming?

I have the input data set like: 我有像这样的输入数据集:

id     operation          value
1      null                1
1      discard             0
2      null                1
2      null                2
2      max                 0
3      null                1
3      null                1
3      list                0

I want to group the input and produce rows according to "operation" column. 我想对输入进行分组并根据“操作”列生成行。

for group 1, operation="discard", then the output is null, 对于组1,operation =“ discard”,则输出为null,

for group 2, operation="max", the output is: 对于组2,operation =“ max”,输出为:

2      null                2

for group 3, operation="list", the output is: 对于组3,operation =“ list”,输出为:

3      null                1
3      null                1

So finally the output is like: 所以最终输出是这样的:

  id     operation          value
   2      null                2
   3      null                1
   3      null                1

Is there a solution for this? 有解决方案吗?

I know there is a similar question how-to-iterate-grouped-data-in-spark But the differences compared to that are: 我知道有一个类似的问题如何在火花中分组数据,但与之相比的区别是:

    1. I want to produce more than one row for each grouped data. 我想为每个分组数据产生多个行。 Possible and how? 可能和如何?
    2. I want my logic to be easily extended for more operation to be added in future. 我希望我的逻辑易于扩展,以便将来增加更多操作。 So User-defined aggregate functions (aka UDAF) is the only possible solution? 因此,用户定义的聚合函数(又名UDAF)是唯一可能的解决方案?

Update 1: 更新1:

Thank stack0114106, then more details according to his answer, eg for id=1, operation="max", I want to iterate all the item with id=2, and find the max value, rather than assign a hard-coded value, that's why I want to iterate the rows in each group. 谢谢stack0114106,然后根据他的回答提供了更多详细信息,例如对于id = 1,operation =“ max”,我想迭代id = 2的所有项目,并找到最大值,而不是分配硬编码的值,这就是为什么我要迭代每个组中的行。 Below is a updated example: 下面是一个更新的示例:

The input: 输入:

scala> val df = Seq((0,null,1),(0,"discard",0),(1,null,1),(1,null,2),(1,"max",0),(2,null,1),(2,null,3),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id"
,"operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]

scala> df.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|0  |null     |1    |
|0  |discard  |0    |
|1  |null     |1    |
|1  |null     |2    |
|1  |max      |0    |
|2  |null     |1    |
|2  |null     |3    |
|2  |max      |0    |
|3  |null     |1    |
|3  |null     |1    |
|3  |list     |0    |
+---+---------+-----+

The expected output: 预期输出:

+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1  |null     |2    |
|2  |null     |3    |
|3  |null     |1    |
|3  |null     |1    |
+---+---------+-----+

group everything collecting the values, then write logic for each operation : 将收集值的所有内容分组,然后为每个操作编写逻辑:

import org.apache.spark.sql.functions._
val grouped=df.groupBy($"id").agg(max($"operation").as("op"),collect_list($"value").as("vals"))
val maxs=grouped.filter($"op"==="max").withColumn("val",explode($"vals")).groupBy($"id").agg(max("val").as("value"))
val lists=grouped.filter($"op"==="list").withColumn("value",explode($"vals")).filter($"value"!==0).select($"id",$"value")
//we don't collect the "discard"
//and we can add additional subsets for new "operations"
val result=maxs.union(lists)
//if you need the null in "operation" column add it with withColumn

You can use flatMap operation on the dataframe and generate required rows based on the conditions that you mentioned. 您可以对数据框使用flatMap操作,并根据您提到的条件生成所需的行。 Check this out 看一下这个

scala> val df = Seq((1,null,1),(1,"discard",0),(2,null,1),(2,null,2),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id","operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]

scala> df.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1  |null     |1    |
|1  |discard  |0    |
|2  |null     |1    |
|2  |null     |2    |
|2  |max      |0    |
|3  |null     |1    |
|3  |null     |1    |
|3  |list     |0    |
+---+---------+-----+


scala> df.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => (0,0) case "max" => (1,2) case "list" => (2,1) } ; (0
 until s._1).map( i => (r.getInt(0),null,s._2) ) }).show(false)
+---+----+---+
|_1 |_2  |_3 |
+---+----+---+
|2  |null|2  |
|3  |null|1  |
|3  |null|1  |
+---+----+---+

Spark assigns _1,_2 etc.. so you can map them to actual names by assigning them as below Spark会分配_1,_2等。因此,您可以通过如下分配它们来将它们映射为实际名称

scala> val df2 = df.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => (0,0) case "max" => (1,2) case "list" => (2,1) } ; (0 until s._1).map( i => (r.getInt(0),null,s._2) ) }).toDF("id","operation","value")
df2: org.apache.spark.sql.DataFrame = [id: int, operation: null ... 1 more field]

scala> df2.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|2  |null     |2    |
|3  |null     |1    |
|3  |null     |1    |
+---+---------+-----+


scala>

EDIT1: EDIT1:

Since you need the max(value) for each id, you can use window functions and get the max value in a new column, then use the same technique and get the results. 由于每个ID都需要max(value),因此可以使用窗口函数并在新列中获取最大值,然后使用相同的技术并获取结果。 Check this out 看一下这个

scala> val df =   Seq((0,null,1),(0,"discard",0),(1,null,1),(1,null,2),(1,"max",0),(2,null,1),(2,null,3),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id","operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]

scala> df.createOrReplaceTempView("michael")

scala> val df2 = spark.sql(""" select *, max(value) over(partition by id) mx from michael """)
df2: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 2 more fields]

scala> df2.show(false)
+---+---------+-----+---+
|id |operation|value|mx |
+---+---------+-----+---+
|1  |null     |1    |2  |
|1  |null     |2    |2  |
|1  |max      |0    |2  |
|3  |null     |1    |1  |
|3  |null     |1    |1  |
|3  |list     |0    |1  |
|2  |null     |1    |3  |
|2  |null     |3    |3  |
|2  |max      |0    |3  |
|0  |null     |1    |1  |
|0  |discard  |0    |1  |
+---+---------+-----+---+


scala> val df3 = df2.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => 0 case "max" => 1 case "list" => 2 } ; (0 until s).map( i => (r.getInt(0),null,r.getInt(3) )) }).toDF("id","operation","value")
df3: org.apache.spark.sql.DataFrame = [id: int, operation: null ... 1 more field]


scala> df3.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1  |null     |2    |
|3  |null     |1    |
|3  |null     |1    |
|2  |null     |3    |
+---+---------+-----+


scala>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM