如何将 groupBy 和聚合函数应用于 PySpark DataFrame 中的特定 window？

Question

I would like to apply a groupBy and a subsequent agg function to a PySpark DataFrame, but only to a specific window.我想将groupBy和随后的agg function 应用于 PySpark DataFrame，但仅限于特定的 Z2FBF245B8C35CBD276。 This is best illustrated by an example.这最好用一个例子来说明。 Suppose that I have a dataset named df :假设我有一个名为df的数据集：

df.show()

    +-----+----------+----------+-------+
    |   ID| Timestamp| Condition|  Value|
    +-----+----------+----------+-------+
    |   z1|         1|         0|     50|
|-------------------------------------------|
|   |   z1|         2|         0|     51|   |
|   |   z1|         3|         0|     52|   |
|   |   z1|         4|         0|     51|   |
|   |   z1|         5|         1|     51|   |
|   |   z1|         6|         0|     49|   |
|   |   z1|         7|         0|     44|   |
|   |   z1|         8|         0|     46|   |
|-------------------------------------------|
    |   z1|         9|         0|     48|
    |   z1|        10|         0|     42|
 +-----+----------+----------+-------+

Particularly, what I would like to do is to apply a kind of window of +- 3 rows to the row where column Condition == 1 (ie in this case, row 5).特别是，我想做的是将一种 +- 3 行的 window 应用于列Condition == 1的行（即在这种情况下为第 5 行）。 Within that window, as depicted in the above DataFrame, I would like to find the minimum value of column Value and the corresponding value of column Timestamp , thus obtaining:在那个 window 中，如上面的 DataFrame 所示，我想找到列Value的最小值和列Timestamp的对应值，从而得到：

+----------+----------+
| Min_value| Timestamp|
+----------+----------+
|        44|         7|
+----------+----------+

Does anyone know how this can be tackled?有谁知道如何解决这个问题？

Many thanks in advance提前谢谢了

Marioanzas马里安萨斯

Answer 1

You can use a window that spans between 3 preceding and 3 following rows, get the minimum, and filter the condition:您可以使用跨越前 3 行和后 3 行的 window，获取最小值并过滤条件：

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'min',
    F.min(
        F.struct('Value', 'Timestamp')
    ).over(Window.partitionBy('ID').orderBy('Timestamp').rowsBetween(-3,3))
).filter('Condition = 1').select('min.*')

df2.show()
+-----+---------+
|Value|Timestamp|
+-----+---------+
|   44|        7|
+-----+---------+

如何将 groupBy 和聚合函数应用于 PySpark DataFrame 中的特定 window？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-10 08:27:59

如何将 groupBy 和聚合函数应用于 PySpark DataFrame 中的特定 window？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-10 08:27:59

解决方案1
1 已采纳 2021-02-10 08:27:59