是否有等效于 Pandas TimeGrouper 的 PySpark？

Question

I have this code in Python Pandas, with a dataframe 'df' which contains the columns 'Connectivity_Tmstp', 'sensor_id' and 'duration_seconds':我在 Python Pandas 中有这个代码，带有 dataframe 'df' 包含列 'Connectivity_Tmstp_' 和 's'sensor_id

df.set_index('Connectivity_Tmstp', inplace=True)  
grouper_hour = df.groupby([pd.Grouper(freq='H'), 'sensor_id'])`  
result_stop_hour = grouper_hour['duration_seconds'].sum()`  
kpi_df = result_stop_hour.to_frame()`

This code allows me to put the column 'Connectivity_Tmstp' in index, then to do a groupby on hours and sensor_id.此代码允许我将“Connectivity_Tmstp”列放在索引中，然后对小时数和 sensor_id 进行分组。 Finally I can sum the hours in each group and put the result in a new dataframe like this:最后，我可以总结每个组的小时数，并将结果放入一个新的 dataframe 中，如下所示：

Connectivity_Tmstp       |   sensor_id                |   duration_seconds          

2018-10-14 07:00:00      | 70b3d5e75e003fb7           |          60
                         | 70b3d5e75e004348           |          40
                         | 70b3d5e75e00435e           |          20
2018-11-02 07:00:00      | 70b3d5e75e0043b3           |          80
                         | 70b3d5e75e0043d7           |          10
                         | 70b3d5e75e0043da           |          60
2019-07-18 12:00:00      | 70b3d5e75e003fb8           |          40
                         | 70b3d5e75e00431c           |          10
                         | 70b3d5e75e0043c1           |          20
                         | 70b3d5e75e0043da           |          30

Do you know how to do the same thing in PySpark?你知道如何在 PySpark 中做同样的事情吗？

Thanks for your answer.感谢您的回答。

Regards, Fab问候，法布

Answer 1

Yes.是的。 You can use Window functions:您可以使用Window功能：

A good resource: Databricks - Introducing Window functions in Spark-SQL一个很好的资源： Databricks - 在 Spark-SQL 中介绍 Window 函数

If you have a granular timeseries and you want to resample it to an hourly fashion: PySpark: how to resample frequencies如果您有一个细粒度的时间序列，并且您想按小时重新采样： PySpark：如何重新采样频率

from pyspark.sql.window import Window
import pyspark.sql.functions as F

w = Window().partitionBy("sensor_id").orderBy("Connectivity_Tmstp")

df = df.withColumn('sum', F.sum(F.col('duration_seconds')).over(w))

是否有等效于 Pandas TimeGrouper 的 PySpark？

问题描述

1 个解决方案

解决方案1
0 2019-10-07 18:01:56

是否有等效于 Pandas TimeGrouper 的 PySpark？

问题描述

1 个解决方案

解决方案1 0 2019-10-07 18:01:56

解决方案1
0 2019-10-07 18:01:56