简体   繁体   English

是否有等效于 Pandas TimeGrouper 的 PySpark?

[英]Is there a PySpark equivalent of Pandas TimeGrouper?

I have this code in Python Pandas, with a dataframe 'df' which contains the columns 'Connectivity_Tmstp', 'sensor_id' and 'duration_seconds':我在 Python Pandas 中有这个代码,带有 dataframe 'df' 包含列 'Connectivity_Tmstp_' 和 's'sensor_id

df.set_index('Connectivity_Tmstp', inplace=True)  
grouper_hour = df.groupby([pd.Grouper(freq='H'), 'sensor_id'])`  
result_stop_hour = grouper_hour['duration_seconds'].sum()`  
kpi_df = result_stop_hour.to_frame()`

This code allows me to put the column 'Connectivity_Tmstp' in index, then to do a groupby on hours and sensor_id.此代码允许我将“Connectivity_Tmstp”列放在索引中,然后对小时数和 sensor_id 进行分组。 Finally I can sum the hours in each group and put the result in a new dataframe like this:最后,我可以总结每个组的小时数,并将结果放入一个新的 dataframe 中,如下所示:

Connectivity_Tmstp       |   sensor_id                |   duration_seconds          

2018-10-14 07:00:00      | 70b3d5e75e003fb7           |          60
                         | 70b3d5e75e004348           |          40
                         | 70b3d5e75e00435e           |          20
2018-11-02 07:00:00      | 70b3d5e75e0043b3           |          80
                         | 70b3d5e75e0043d7           |          10
                         | 70b3d5e75e0043da           |          60
2019-07-18 12:00:00      | 70b3d5e75e003fb8           |          40
                         | 70b3d5e75e00431c           |          10
                         | 70b3d5e75e0043c1           |          20
                         | 70b3d5e75e0043da           |          30 

Do you know how to do the same thing in PySpark?你知道如何在 PySpark 中做同样的事情吗?

Thanks for your answer.感谢您的回答。

Regards, Fab问候, 法布

Yes.是的。 You can use Window functions:您可以使用Window功能:

A good resource: Databricks - Introducing Window functions in Spark-SQL一个很好的资源: Databricks - 在 Spark-SQL 中介绍 Window 函数

If you have a granular timeseries and you want to resample it to an hourly fashion: PySpark: how to resample frequencies如果您有一个细粒度的时间序列,并且您想按小时重新采样: PySpark:如何重新采样频率

from pyspark.sql.window import Window
import pyspark.sql.functions as F

w = Window().partitionBy("sensor_id").orderBy("Connectivity_Tmstp")

df = df.withColumn('sum', F.sum(F.col('duration_seconds')).over(w))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM