[英]Scala: collect_list() over Window with keeping null values
I have a data frame like the below: 我有一个如下数据框:
+----+----+----+
|colA|colB|colC|
+----+----+----+
|1 |1 |23 |
|1 |2 |63 |
|1 |3 |null|
|1 |4 |32 |
|2 |2 |56 |
+----+----+----+
I apply the below instructions such that I create a sequence of values in column C: 我应用以下说明,以便在C列中创建值序列:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD",
collect_list("colC").over(Window.partitionBy("colA").orderBy("colB")))
The result is like this such that column D is created and includes values of column C as a sequence while it has removed null
value: 结果是这样的,即创建了列D,并在删除
null
值的同时包含列C的值作为序列:
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |null|[23, 63] |
|1 |4 |32 |[23,63,32] |
|2 |2 |56 |[56] |
+----+----+----+------------+
However, I would like to keep null values in the new column and have the below result: 但是,我想在新列中保留空值,并得到以下结果:
+----+----+----+-----------------+
|colA|colB|colC|colD |
+----+----+----+-----------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |null|[23, 63, null] |
|1 |4 |32 |[23,63,null, 32] |
|2 |2 |56 |[56] |
+----+----+----+-----------------+
As you see I still have null
values in the result. 如您所见,结果中仍然
null
值。 Do you know how can I do it? 你知道我该怎么办吗?
Since collect_list
automatically removes all null
s, one approach would be to temporarily replace null
with a designated number, say Int.MinValue
, before applying the method, and use a UDF to restore those numbers back to null
afterward: 由于
collect_list
自动删除所有null
,因此一种方法是在应用该方法之前用指定的数字(例如Int.MinValue
临时替换null
,然后使用UDF将这些数字还原为null
:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(
(Some(1), Some(1), Some(23)),
(Some(1), Some(2), Some(63)),
(Some(1), Some(3), None),
(Some(1), Some(4), Some(32)),
(Some(2), Some(2), Some(56))
).toDF("colA", "colB", "colC")
def replaceWithNull(n: Int) = udf( (arr: Seq[Int]) =>
arr.map( i => if (i != n) Some(i) else None )
)
df.withColumn( "colD", replaceWithNull(Int.MinValue)(
collect_list(when($"colC".isNull, Int.MinValue).otherwise($"colC")).
over(Window.partitionBy("colA").orderBy("colB"))
)
).show
// +----+----+----+------------------+
// |colA|colB|colC| colD|
// +----+----+----+------------------+
// | 1| 1| 23| [23]|
// | 1| 2| 63| [23, 63]|
// | 1| 3|null| [23, 63, null]|
// | 1| 4| 32|[23, 63, null, 32]|
// | 2| 2| 56| [56]|
// +----+----+----+------------------+
As LeoC mentioned collect_list
will drop null values. 正如LeoC提到的
collect_list
将删除空值。 There seems to be a workaround to this behavior. 似乎有解决此问题的方法。 By wrapping each scalar into array following by
collect_list
will result in [[23], [63], [], [32]]
then when you do flatten
on that you will get [23, 63,, 32]
. 通过将每个标量包装到数组中,
collect_list
将得到[[23], [63], [], [32]]
然后当您flatten
collect_list
将得到[[23], [63], [], [32]]
collect_list
[23, 63,, 32]
。 Those missing values in arrays are nulls. 数组中的那些缺失值是空值。
collect_list
and flatten
built-in sql functions I believe were introduced in Spark 2.4 . 我相信在Spark 2.4中引入了
collect_list
和flatten
内置的sql函数。 I didn't look into implementation to verify this is expected behavior so I don't know how reliable this solution is. 我没有研究实现来验证这是预期的行为,所以我不知道此解决方案的可靠性。
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(
(Some(1), Some(1), Some(23)),
(Some(1), Some(2), Some(63)),
(Some(1), Some(3), None),
(Some(1), Some(4), Some(32)),
(Some(2), Some(2), Some(56))
).toDF("colA", "colB", "colC")
val newDf = df.withColumn("colD", flatten(collect_list(array("colC"))
.over(Window.partitionBy("colA").orderBy("colB"))))
+----+----+----+-------------+
|colA|colB|colC| colD|
+----+----+----+-------------+
| 1| 1| 23| [23]|
| 1| 2| 63| [23, 63]|
| 1| 3|null| [23, 63,]|
| 1| 4| 32|[23, 63,, 32]|
| 2| 2| 56| [56]|
+----+----+----+-------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.