[英]Is there a PySpark equivalent of Pandas TimeGrouper?
I have this code in Python Pandas, with a dataframe 'df' which contains the columns 'Connectivity_Tmstp', 'sensor_id' and 'duration_seconds':我在 Python Pandas 中有这个代码,带有 dataframe 'df' 包含列 'Connectivity_Tmstp_' 和 's'sensor_id
df.set_index('Connectivity_Tmstp', inplace=True)
grouper_hour = df.groupby([pd.Grouper(freq='H'), 'sensor_id'])`
result_stop_hour = grouper_hour['duration_seconds'].sum()`
kpi_df = result_stop_hour.to_frame()`
This code allows me to put the column 'Connectivity_Tmstp' in index, then to do a groupby on hours and sensor_id.此代码允许我将“Connectivity_Tmstp”列放在索引中,然后对小时数和 sensor_id 进行分组。 Finally I can sum the hours in each group and put the result in a new dataframe like this:
最后,我可以总结每个组的小时数,并将结果放入一个新的 dataframe 中,如下所示:
Connectivity_Tmstp | sensor_id | duration_seconds
2018-10-14 07:00:00 | 70b3d5e75e003fb7 | 60
| 70b3d5e75e004348 | 40
| 70b3d5e75e00435e | 20
2018-11-02 07:00:00 | 70b3d5e75e0043b3 | 80
| 70b3d5e75e0043d7 | 10
| 70b3d5e75e0043da | 60
2019-07-18 12:00:00 | 70b3d5e75e003fb8 | 40
| 70b3d5e75e00431c | 10
| 70b3d5e75e0043c1 | 20
| 70b3d5e75e0043da | 30
Do you know how to do the same thing in PySpark?你知道如何在 PySpark 中做同样的事情吗?
Thanks for your answer.感谢您的回答。
Regards, Fab问候, 法布
Yes.是的。 You can use Window functions:
您可以使用Window功能:
A good resource: Databricks - Introducing Window functions in Spark-SQL一个很好的资源: Databricks - 在 Spark-SQL 中介绍 Window 函数
If you have a granular timeseries and you want to resample it to an hourly fashion: PySpark: how to resample frequencies如果您有一个细粒度的时间序列,并且您想按小时重新采样: PySpark:如何重新采样频率
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w = Window().partitionBy("sensor_id").orderBy("Connectivity_Tmstp")
df = df.withColumn('sum', F.sum(F.col('duration_seconds')).over(w))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.