组合多行，在 Pyspark/Python 中具有不同的值

Question

我有一个如下表：

ID Date         Class     Level
1  2021/01/01    math      1
1  2021/01/01    english   1
1  2021/01/01    history   1

我目前的代码是

    grouped_df = df\
    .groupby('ID','Date')\
    .agg(collect_list('class').alias("class"),collect_list('level').alias("level"))\
    .withColumn("class", concat_ws(", ", "class"))\
    .withColumn("level", concat_ws(", ", "level"))

代码给我 output 看起来像这样：

    ID Date         Class                       Level
    1  2021/01/01    math, english,history      1, 1, 1

并且因为我想进行组合行排序，所以我使用concat_ws ，但是 class 的顺序不是按愿望排序的，有没有办法在执行concat_ws()时对其进行排序？ 我想结合class按字母顺序排序。 比如English, history, math 。 但是当我执行 concat_ws 时，output 可以是math_english,history,或history, math, English 。

有没有办法使 output 如下所示：

  ID Date        Class                       Level
  1  2021/01/01  english,history,math        1

Answer 1

您可以使用collect_set删除重复项：

grouped_df = df\
    .groupby('ID','Date')\
    .agg(collect_list('class').alias("class"),collect_set('level').alias("level"))\
    .withColumn("class", concat_ws(", ", "class"))\
    .withColumn("level", concat_ws(", ", "level"))

如果总是只有一个级别，您也可以考虑按级别分组，例如

grouped_df = df\
    .groupby('ID','Date', 'level')\
    .agg(collect_list('class').alias("class"))\
    .withColumn("class", concat_ws(", ", "class"))

编辑：如果要对数组进行排序，可以使用sort_array ：

grouped_df = df\
    .groupby('ID','Date')\
    .agg(sort_array(collect_list('class')).alias("class"),collect_set('level').alias("level"))\
    .withColumn("class", concat_ws(", ", "class"))\
    .withColumn("level", concat_ws(", ", "level"))

Answer 2

要获取level的唯一值，请使用collect_set并订购class值，您不能将array_sort与 Spark 2.3 一起使用，但您可以在有序的 window 上使用collect_list来获取排序列表，而不是使用 UDF，这通常会导致性能不佳：

from pyspark.sql import Window
from pyspark.sql import functions as F


w = Window.partitionBy("ID", "Date").orderBy("Class")

grouped_df = df.withColumn("Class", F.collect_list("Class").over(w)) \
    .withColumn("Level", F.collect_set("Level").over(w)) \
    .groupBy("ID", "Date") \
    .agg(
    F.concat_ws(",", F.max("Class")).alias("Class"),
    F.concat_ws(",", F.max("Level")).alias("Level")
)

grouped_df.show(truncate=False)

# +---+----------+--------------------+-----+
# |ID |Date      |Class               |Level|
# +---+----------+--------------------+-----+
# |1  |2021/01/01|english,history,math|1    |
# +---+----------+--------------------+-----+

组合多行，在 Pyspark/Python 中具有不同的值

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-02-11 15:56:21

解决方案2
0 2021-02-11 17:12:06

组合多行，在 Pyspark/Python 中具有不同的值

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-02-11 15:56:21

解决方案2 0 2021-02-11 17:12:06

解决方案1
1 已采纳 2021-02-11 15:56:21

解决方案2
0 2021-02-11 17:12:06