Spark SQL 类似于组内的 LISTAGG() OR GROUP_CONCAT

Question

我需要在组内实现一个类似于 redshift listagg() 的 function（按 x_column 排序），但重要的是要在 Spark SQL 中，这里是 https://spark.apache.org/docs/2.4.0/api/数据库/

有这个类似的问题，但答案不是 SQL。

我对 Redshift SQL 的查询是：

select KEY,
listagg(CODE, '-') within group (order by DATE) as CODE
from demo_table
group by KEY

此时 order by 语句并不重要，仅将所有列与 group by 聚合就足够了，我已经尝试使用 concat_ws 但它没有按预期工作

把它放在 pyspark 上对我不起作用

钥匙	代码	日期
66	PL	11/1/2016
66	PL	12/1/2016
67	杰伦	12/1/2016
67	杰伦	10/1/2016
67	PL	9/1/2016
67	采购订单	8/1/2016
67	杰伦	12/1/2016
68	PL	11/1/2016
68	乔	11/1/2016

所需 output

钥匙	代码
68	JO-PL
67	JL - JL - PL - PO - JL
68	PL-JO

Answer 1

array_join和collect_list

select 
 key, 
 array_join( -- concat the array
  collect_list(code), -- aggregate that collects the array of [code]
  ' - ' -- delimiter 
 )
from demo_table
group by KEY

Answer 2

下面的查询将起作用。 它还包括订购方式。 请检查。

spark.sql("""select key,max(code) 
             from ( select key,array_join(collect_list(code) over (partition by key order by to_date(date,'m/d/yyyy')),'-') code from view) 
             group by key""").show(100)

Answer 3

在 Spark-SQL 中使用“ORDER BY”和“COLLECT_LIST()”：


df = spark.createDataFrame([
(1, "a3"),
(1, "a2"),
(4, "c"),
(1, "b2"),
(2, "b10"),
(4, "a"),
(2, "a1"),
(1, "a0"),
(3, "c"),
(4, "d")], ("k", "v"))

df.createOrReplaceTempView('source_df')

spark.sql("""
    SELECT t.k
        ,array_join(collect_list(t.v),'-') AS result_list            
    FROM (
        SELECT k
            ,v
        FROM source_df
        ORDER BY k,v
        ) t
    GROUP BY t.k
""").show()

下面是output：

+---+-----------+
|  k|result_list|
+---+-----------+
|  1|a0-a2-a3-b2|
|  3|          c|
|  2|     a1-b10|
|  4|      a-c-d|
+---+-----------+

Spark SQL 类似于组内的 LISTAGG() OR GROUP_CONCAT

问题描述

3 个解决方案

解决方案1
2 2022-05-25 19:49:07

解决方案2
0 2022-12-14 02:50:23

解决方案3
0 2023-01-20 17:57:00

Spark SQL 类似于组内的 LISTAGG() OR GROUP_CONCAT

问题描述

3 个解决方案

解决方案1 2 2022-05-25 19:49:07

解决方案2 0 2022-12-14 02:50:23

解决方案3 0 2023-01-20 17:57:00

解决方案1
2 2022-05-25 19:49:07

解决方案2
0 2022-12-14 02:50:23

解决方案3
0 2023-01-20 17:57:00