在PySpark中使用“窗口”功能按日分组的问题

Question

我有一个数据集需要重新采样。 为此，我需要将其按天分组，并同时计算每个传感器的中值。 我正在使用window函数，但是，它仅返回一个样本。

这是数据集：

+--------+-------------+-------------------+------+------------------+
|Variable|  Sensor Name|          Timestamp| Units|             Value|
+--------+-------------+-------------------+------+------------------+
|     NO2|aq_monitor914|2018-10-07 23:15:00|ugm -3|0.9945200000000001|
|     NO2|aq_monitor914|2018-10-07 23:30:00|ugm -3|1.1449200000000002|
|     NO2|aq_monitor914|2018-10-07 23:45:00|ugm -3|           1.13176|
|     NO2|aq_monitor914|2018-10-08 00:00:00|ugm -3|            0.9212|
|     NO2|aq_monitor914|2018-10-08 00:15:00|ugm -3|           1.39872|
|     NO2|aq_monitor914|2018-10-08 00:30:00|ugm -3|           1.51528|
|     NO2|aq_monitor914|2018-10-08 00:45:00|ugm -3|           1.61116|
|     NO2|aq_monitor914|2018-10-08 01:00:00|ugm -3|           1.59612|
|     NO2|aq_monitor914|2018-10-08 01:15:00|ugm -3|           1.12612|
|     NO2|aq_monitor914|2018-10-08 01:30:00|ugm -3|           1.04528|
+--------+-------------+-------------------+------+------------------+

我需要按天对其进行重新采样，以计算每天“值”列的中位数。 我正在使用以下代码来做到这一点：

magic_percentile = psf.expr('percentile_approx(Value, 0.5)') #Calculates median of the 'Value' column 

data = data.groupby('Variable','Sensor Name',window('Timestamp', "1 day")).agg(magic_percentile.alias('Value')

但是，这是问题所在，这只是返回以下DataFrame：

+--------+-------------+--------------------+-------+
|Variable|  Sensor Name|              window|  Value|
+--------+-------------+--------------------+-------+
|     NO2|aq_monitor914|[2018-10-07 21:00...|1.13176|
+--------+-------------+--------------------+-------+

详细说明“窗口”列：

window=Row(start=datetime.datetime(2018, 10, 7, 21, 0), end=datetime.datetime(2018, 10, 8, 21, 0))

以我对window理解，它应该为当前时间戳设置一个一日窗口，例如： 2018-10-07 23:15:00应该变成： 2018-10-07并按变量，传感器名称对传感器进行分组，然后计算当天的中位数。 我真的对如何做到这一点感到困惑。

Answer 1

我相信您不需要使用Window即可实现所需的功能。 例如，如果您想对每个给定日期之前的日期进行一些汇总，则将需要此功能。 在您的示例中，仅将datetime列解析为date并在groupBy语句中使用groupBy 。 下面给出一个工作示例，希望这会有所帮助！

import pyspark.sql.functions as psf

df = sqlContext.createDataFrame(
    [
     ('NO2','aq_monitor914','2018-10-07 23:15:00',0.9945200000000001),
     ('NO2','aq_monitor914','2018-10-07 23:30:00',1.1449200000000002),
     ('NO2','aq_monitor914','2018-10-07 23:45:00',1.13176),
     ('NO2','aq_monitor914','2018-10-08 00:00:00',0.9212),
     ('NO2','aq_monitor914','2018-10-08 00:15:00',1.39872),
     ('NO2','aq_monitor914','2018-10-08 00:30:00',1.51528)
    ],
    ("Variable","Sensor Name","Timestamp","Value")
)
df = df.withColumn('Timestamp',psf.to_timestamp("Timestamp", "yyyy-MM-dd HH:mm:ss"))
df.show()

magic_percentile = psf.expr('percentile_approx(Value, 0.5)')
df_agg = df.groupBy('Variable','Sensor Name',psf.to_date('Timestamp').alias('Day')).agg(magic_percentile.alias('Value'))
df_agg.show()

输入：

+--------+-------------+-------------------+------------------+
|Variable|  Sensor Name|          Timestamp|             Value|
+--------+-------------+-------------------+------------------+
|     NO2|aq_monitor914|2018-10-07 23:15:00|0.9945200000000001|
|     NO2|aq_monitor914|2018-10-07 23:30:00|1.1449200000000002|
|     NO2|aq_monitor914|2018-10-07 23:45:00|           1.13176|
|     NO2|aq_monitor914|2018-10-08 00:00:00|            0.9212|
|     NO2|aq_monitor914|2018-10-08 00:15:00|           1.39872|
|     NO2|aq_monitor914|2018-10-08 00:30:00|           1.51528|
+--------+-------------+-------------------+------------------+

输出：

+--------+-------------+----------+-------+
|Variable|  Sensor Name|       Day|  Value|
+--------+-------------+----------+-------+
|     NO2|aq_monitor914|2018-10-07|1.13176|
|     NO2|aq_monitor914|2018-10-08|1.39872|
+--------+-------------+----------+-------+

在PySpark中使用“窗口”功能按日分组的问题

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-02-08 07:29:23

在PySpark中使用“窗口”功能按日分组的问题

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-02-08 07:29:23

解决方案1
0 已采纳 2019-02-08 07:29:23