简体   繁体   English

在 pyspark dataframe 中检索最大值时遇到问题

[英]Having trouble on retrieving max values in a pyspark dataframe

After I calculate average of quantities within 5 rows for each row in a pyspark dataframe by using window and partitioning over a group of columns在我计算 pyspark dataframe 中每行 5 行内的数量平均值后,使用 window 并在一组列上进行分区

from pyspark.sql import functions as F
prep_df = ...
window = Window.partitionBy([F.col(x) for x in group_list]).rowsBetween(Window.currentRow, Window.currentRow + 4)
consecutive_df = prep_df.withColumn('aveg', F.avg(prep_df['quantity']).over(window))

I am trying to group by with the same group and select the maximum value of the average values like this:我正在尝试与同一组和 select 的平均值进行分组,如下所示:

grouped_consecutive_df = consecutive_df.groupBy(group_column_list).agg(F.max(consecutive_df['aveg']).alias('aveg'))

However, when I debug, I see that the calculated maximum values are wrong.但是,当我调试时,我发现计算出的最大值是错误的。 For specific instances, I saw that the retrieved max numbers are not even in the 'aveg' column.对于特定情况,我看到检索到的最大数字甚至不在“平均”列中。

I'd like to ask whether I am taking a false approach or missing something trivial.我想问一下我是否采取了错误的方法或遗漏了一些琐碎的事情。 Any comments are appreciated.任何意见表示赞赏。

I could solve this by a workaround like this: Before aggregation, I mapped the max values of quantity averages to another new column, then I selected one of the rows in the group.我可以通过这样的解决方法来解决这个问题:在聚合之前,我将数量平均值的最大值映射到另一个新列,然后我选择了组中的一个行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM