简体   繁体   English

Spark DataFrame在保留顺序的同时聚合和分组多列

[英]Spark DataFrame aggregate and groupby multiple columns while retaining order

I have the following data 我有以下数据

id | value1 | value2 
-----------------------
1         A       red
1         B       red
1         C      blue
2         A      blue
2         B      blue
2         C     green

The result I need is: 我需要的结果是:

id |                       values
---------------------------------
 1      [[A,red],[B,red][C,blue]]
 2   [[A,blue],[B,blue][C,green]]

My approach so far is to groupby and aggregate value1 and value2 in seperate arrays and then merge them together as described in Combine PySpark DataFrame ArrayType fields into single ArrayType field 到目前为止,我的方法是在单独的数组中对value1和value2进行分组和聚合,然后将它们合并在一起,如将PySpark DataFrame ArrayType字段组合到单个ArrayType字段中所述

df.groupBy(["id"]).agg(*[F.collect_list("value1"), F.collect_list("value2")])

However since order is not guaranteed in collect_list() (see here ), how can I make sure value1 and value2 are both matched to the correct values? 但是,由于在collect_list()无法保证顺序(请参见此处 ),如何确保value1和value2都匹配正确的值?

This could potentially lead to two lists with different order and subsequent merging would match wrong values? 这可能会导致两个列表具有不同的顺序,后续合并会匹配错误的值?

As commented by @Raphael, you can combine value1 and value2 columns into a single struct type column firstly, and then collect_list : 正如@Raphael所评论的,您可以先将value1value2列合并为一个struct类型列,然后再collect_list

import pyspark.sql.functions as F

(df.withColumn('values', F.struct(df.value1, df.value2))
   .groupBy('id')
   .agg(F.collect_list('values').alias('values'))).show()

+---+--------------------+
| id|              values|
+---+--------------------+
|  1|[[A,red], [B,red]...|
|  2|[[A,blue], [B,blu...|
+---+--------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas 数据框,groupBy 聚合多列和多行 - Pandas dataframe, groupBy aggregate multiple columns and rows Python - 按季度在 groupby 中聚合多个 dataframe 列 - Python - Aggregate multiple dataframe columns in a groupby on quarterly basis 聚合 Function 到 dataframe,同时保留 Pandas 中的行 - Aggregate Function to dataframe while retaining rows in Pandas Spark DataFrame的通用“ reduceBy”或“ groupBy +聚合”功能 - Generic “reduceBy” or “groupBy + aggregate” functionality with Spark DataFrame GroupBy 使用 select 列和 apply(list) 并保留 dataframe 的其他列 - GroupBy using select columns with apply(list) and retaining other columns of the dataframe 使用 groupby 并与 pandas dataframe 对列*和*索引进行聚合 - Use groupby and aggregate with pandas dataframe on columns *and* index pandas dataframe 按列分组并在自定义 function 上聚合 - pandas dataframe groupby columns and aggregate on custom function Pandas Dataframe Groupby多列 - Pandas Dataframe Groupby multiple columns 如何使用 groupby 和聚合将 pyspark dataframe 中的行与多列连接起来 - How can I concatenate the rows in a pyspark dataframe with multiple columns using groupby and aggregate 如何聚合pandas groupby中的多个列 - How to aggregate multiple columns in pandas groupby
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM