Pyspark 从每个组的列中获取第一个值

Question

我在 pyspark 中有一个数据框，看起来像这样

|Id1| id2  |row  |grp    |
|12 | 1234 |1    | 1     |
|23 | 1123 |2    | 1     |
|45 | 2343 |3    | 2     |
|65 | 2345 |1    | 2     |
|67 | 3456 |2    | 2     |```

I need to retrieve value for id2 corresponding to row = 1 and update all id2 values within a grp to that value.
This should be the final result

|Id1 | id2  |row |grp|
|12  |1234  |1   |1  |
|23  |1234  |2   |1  |
|45  |2345  |3   |2  |
|65  |2345  |1   |2  |
|67  |2345  |2   |2  |

我尝试做类似 df.groupby('grp').sort('row').first('id2') 但显然 sort 和 orderby 在 pyspark 中不适用于 groupby。

知道如何解决这个问题吗？

Answer 1

与.rowsBetween的回答非常相似，不使用.rowsBetween

您基本上为每个grp创建一个Window ，然后按row对行进行排序并为每个grp选择第一个id2 。

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy('grp').orderBy('row')

df = df.withColumn('id2', F.first('id2').over(w))

df.show()

+---+----+---+---+
|Id1| id2|row|grp|
+---+----+---+---+
| 12|1234|  1|  1|
| 23|1234|  2|  1|
| 65|2345|  1|  2|
| 67|2345|  2|  2|
| 45|2345|  3|  2|
+---+----+---+---+

Answer 2

试试这个：

from pyspark.sql import functions as F, Window as W


df.withColumn(
    "id2",
    F.first("id2").over(
        W.partitionBy("grp")
        .orderBy("row")
        .rowsBetween(W.unboundedPreceding, W.currentRow)
    ),
).show()

+---+----+---+---+                                                              
|id1| id2|row|grp|
+---+----+---+---+
| 12|1234|  1|  1|
| 23|1234|  2|  1|
| 65|2345|  1|  2|
| 45|2345|  2|  2|
| 45|2345|  3|  2|
+---+----+---+---+

Pyspark 从每个组的列中获取第一个值

问题描述

2 个解决方案

解决方案1
2 2021-07-29 14:48:56

解决方案2
0 2021-07-29 14:39:54

Pyspark 从每个组的列中获取第一个值

问题描述

2 个解决方案

解决方案1 2 2021-07-29 14:48:56

解决方案2 0 2021-07-29 14:39:54

解决方案1
2 2021-07-29 14:48:56

解决方案2
0 2021-07-29 14:39:54