[英]Pyspark get first value from a column for each group
I have a data frame in pyspark which would look like this我在 pyspark 中有一个数据框,看起来像这样
|Id1| id2 |row |grp |
|12 | 1234 |1 | 1 |
|23 | 1123 |2 | 1 |
|45 | 2343 |3 | 2 |
|65 | 2345 |1 | 2 |
|67 | 3456 |2 | 2 |```
I need to retrieve value for id2 corresponding to row = 1 and update all id2 values within a grp to that value.
This should be the final result
|Id1 | id2 |row |grp|
|12 |1234 |1 |1 |
|23 |1234 |2 |1 |
|45 |2345 |3 |2 |
|65 |2345 |1 |2 |
|67 |2345 |2 |2 |
I tried doing something like df.groupby('grp').sort('row').first('id2') But apparently sort and orderby don't work with groupby in pyspark.我尝试做类似 df.groupby('grp').sort('row').first('id2') 但显然 sort 和 orderby 在 pyspark 中不适用于 groupby。
Any idea how to go about this?知道如何解决这个问题吗?
Very similar to @Steven's answer, without using .rowsBetween
与.rowsBetween
的回答非常相似,不使用.rowsBetween
You basically create a Window
for each grp
, then sort the rows by row
and pick the first id2
for each grp
.您基本上为每个grp
创建一个Window
,然后按row
对行进行排序并为每个grp
选择第一个id2
。
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy('grp').orderBy('row')
df = df.withColumn('id2', F.first('id2').over(w))
df.show()
+---+----+---+---+
|Id1| id2|row|grp|
+---+----+---+---+
| 12|1234| 1| 1|
| 23|1234| 2| 1|
| 65|2345| 1| 2|
| 67|2345| 2| 2|
| 45|2345| 3| 2|
+---+----+---+---+
try this :试试这个 :
from pyspark.sql import functions as F, Window as W
df.withColumn(
"id2",
F.first("id2").over(
W.partitionBy("grp")
.orderBy("row")
.rowsBetween(W.unboundedPreceding, W.currentRow)
),
).show()
+---+----+---+---+
|id1| id2|row|grp|
+---+----+---+---+
| 12|1234| 1| 1|
| 23|1234| 2| 1|
| 65|2345| 1| 2|
| 45|2345| 2| 2|
| 45|2345| 3| 2|
+---+----+---+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.