聚合Dataframe pyspark

Question

我使用Spark 1.6.2与数据帧

我想转换这个数据帧

+---------+-------------+-----+-------+-------+-------+-------+--------+
|ID       |           P |index|xinf   |xup    |yinf   |ysup   |     M  |
+---------+-------------+-----+-------+-------+-------+-------+--------+
|        0|10279.9003906|   13|    0.3|    0.5|    2.5|    3.0|540928.0|
|        2|12024.2998047|   13|    0.3|    0.5|    2.5|    3.0|541278.0|
|        0|10748.7001953|   13|    0.3|    0.5|    2.5|    3.0|541243.0|
|        1|      10988.5|   13|    0.3|    0.5|    2.5|    3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+

至

+---------+-------------+-----+-------+-------+-------+-------+--------+
|Id       |           P |index|xinf   |xup    |yinf   |ysup   |     M  |
+---------+-------------+-----+-------+-------+-------+-------+--------+
|        0|10514.3002929|   13|    0.3|    0.5|    2.5|    3.0|540928.0,541243.0|
|        2|12024.2998047|   13|    0.3|    0.5|    2.5|    3.0|541278.0|
|        1|      10988.5|   13|    0.3|    0.5|    2.5|    3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+

所以，我想减少Id，并计算P行的平均值并连接M行。 但是我不能使用spark的函数agg来做那件事。

你能帮我吗

Answer 1

您可以将列ID groupBy ，然后根据需要聚合每列， mean和concat将对您有所帮助。

from pyspark.sql.functions import first, collect_list, mean

df.groupBy("ID").agg(mean("P"), first("index"), 
                     first("xinf"), first("xup"), 
                     first("yinf"), first("ysup"), 
                     collect_list("M"))

聚合Dataframe pyspark

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-10-20 20:17:23

聚合Dataframe pyspark

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-10-20 20:17:23

解决方案1
4 已采纳 2016-10-20 20:17:23