在 pyspark dataframe 中检索最大值时遇到问题

Question

After I calculate average of quantities within 5 rows for each row in a pyspark dataframe by using window and partitioning over a group of columns在我计算 pyspark dataframe 中每行 5 行内的数量平均值后，使用 window 并在一组列上进行分区

from pyspark.sql import functions as F
prep_df = ...
window = Window.partitionBy([F.col(x) for x in group_list]).rowsBetween(Window.currentRow, Window.currentRow + 4)
consecutive_df = prep_df.withColumn('aveg', F.avg(prep_df['quantity']).over(window))

I am trying to group by with the same group and select the maximum value of the average values like this:我正在尝试与同一组和 select 的平均值进行分组，如下所示：

grouped_consecutive_df = consecutive_df.groupBy(group_column_list).agg(F.max(consecutive_df['aveg']).alias('aveg'))

However, when I debug, I see that the calculated maximum values are wrong.但是，当我调试时，我发现计算出的最大值是错误的。 For specific instances, I saw that the retrieved max numbers are not even in the 'aveg' column.对于特定情况，我看到检索到的最大数字甚至不在“平均”列中。

I'd like to ask whether I am taking a false approach or missing something trivial.我想问一下我是否采取了错误的方法或遗漏了一些琐碎的事情。 Any comments are appreciated.任何意见表示赞赏。

Answer 1

I could solve this by a workaround like this: Before aggregation, I mapped the max values of quantity averages to another new column, then I selected one of the rows in the group.我可以通过这样的解决方法来解决这个问题：在聚合之前，我将数量平均值的最大值映射到另一个新列，然后我选择了组中的一个行。

在 pyspark dataframe 中检索最大值时遇到问题

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-12-03 23:25:21

在 pyspark dataframe 中检索最大值时遇到问题

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-12-03 23:25:21

解决方案1
0 已采纳 2020-12-03 23:25:21