简体   繁体   English

spark dataframe 列组和 collect_to_list 不按照 orderBy

[英]spark dataframe column group and collect_to_list not according to orderBy

I have a spark 2.2.0 dataframe dtfBase1 as below.我有一个 spark 2.2.0 dataframe dtfBase1 ,如下所示。 BAQ is ID, AAA is date and AAG is numeric value in double. BAQ 是 ID,AAA 是日期,AAG 是 double 中的数值。

在此处输入图像描述

And I would like to convert it into the following.我想将其转换为以下内容。 The value of AAG should be indexed according to the order of AAA. AAG 的值应该按照 AAA 的顺序进行索引。

在此处输入图像描述

I used the following code我使用了以下代码

val dtfBase2=dtfBase1.orderBy($"BAQ",$"AAA").groupBy("BAQ").agg(collect_list("AAG") as "arrAAG")

But apparently in dtfBase2 the values of AAG seemed followed a random index instead of AAA's order in the original dataframe. How I index elements in arrAAG according to the order of AAA?但显然在dtfBase2中,AAG 的值似乎遵循随机索引而不是原始 dataframe 中的 AAA 顺序。我如何根据 AAA 的顺序索引 arrAAG 中的元素?

在此处输入图像描述

Assuming you're on Spark 2.4+, you can use array_sort and array_join假设您使用的是 Spark 2.4+,您可以使用array_sortarray_join

val dtfBase2 = dtfBase1.groupBy("BAQ")
  .agg(array_sort(collect_list(struct('aaa, 'aag))) as "arrAAG")
  .select('baq, array_join($"arrAAG.aag", ",")  as "arrAAG")

It creates a struct with the AAA and AAG, collects those in the aggregate and then sorts.它使用 AAA 和 AAG 创建一个结构,将它们收集到聚合中,然后进行排序。 We then concatenate using array_join , but just on the AAG element of the struct.然后我们使用array_join连接,但只是在结构的AAG元素上。

Since you're on Spark 2.2, this version should work由于您使用的是 Spark 2.2,因此该版本应该可以使用

val dtfBase2 = dtfBase1.groupBy("BAQ")
  .agg(sort_array(collect_list(struct('aaa, 'aag))) as "arrAAG")
  .select('baq, concat_ws(",", $"arrAAG.aag") as "arrAAG")

I did this and it worked out.我这样做了并且成功了。 Somehow by caching dtfBase1 with desired orderBy, the order was remembered somewhere and got passed to next step.不知何故,通过用所需的 orderBy 缓存dtfBase1 ,顺序被记住在某处并传递给下一步。 Feel free to suggest something doing it in one line.随意建议在一行中做某事。

val dtfBase1=....orderBy($"BAQ",$"AAA").cache()
val dtfBase2=dtfBase1.orderBy($"BAQ",$"AAA").groupBy("BAQ").agg(collect_list("AAG") as "arrAAG")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM