简体   繁体   English

Pyspark 从每个组的列中获取第一个值

[英]Pyspark get first value from a column for each group

I have a data frame in pyspark which would look like this我在 pyspark 中有一个数据框,看起来像这样

|Id1| id2  |row  |grp    |
|12 | 1234 |1    | 1     |
|23 | 1123 |2    | 1     |
|45 | 2343 |3    | 2     |
|65 | 2345 |1    | 2     |
|67 | 3456 |2    | 2     |```

I need to retrieve value for id2 corresponding to row = 1 and update all id2 values within a grp to that value.
This should be the final result

|Id1 | id2  |row |grp|
|12  |1234  |1   |1  |
|23  |1234  |2   |1  |
|45  |2345  |3   |2  |
|65  |2345  |1   |2  |
|67  |2345  |2   |2  |

I tried doing something like df.groupby('grp').sort('row').first('id2') But apparently sort and orderby don't work with groupby in pyspark.我尝试做类似 df.groupby('grp').sort('row').first('id2') 但显然 sort 和 orderby 在 pyspark 中不适用于 groupby。

Any idea how to go about this?知道如何解决这个问题吗?

Very similar to @Steven's answer, without using .rowsBetween.rowsBetween的回答非常相似,不使用.rowsBetween

You basically create a Window for each grp , then sort the rows by row and pick the first id2 for each grp .您基本上为每个grp创建一个Window ,然后按row对行进行排序并为每个grp选择第一个id2

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy('grp').orderBy('row')

df = df.withColumn('id2', F.first('id2').over(w))

df.show()

+---+----+---+---+
|Id1| id2|row|grp|
+---+----+---+---+
| 12|1234|  1|  1|
| 23|1234|  2|  1|
| 65|2345|  1|  2|
| 67|2345|  2|  2|
| 45|2345|  3|  2|
+---+----+---+---+

try this :试试这个 :

from pyspark.sql import functions as F, Window as W


df.withColumn(
    "id2",
    F.first("id2").over(
        W.partitionBy("grp")
        .orderBy("row")
        .rowsBetween(W.unboundedPreceding, W.currentRow)
    ),
).show()

+---+----+---+---+                                                              
|id1| id2|row|grp|
+---+----+---+---+
| 12|1234|  1|  1|
| 23|1234|  2|  1|
| 65|2345|  1|  2|
| 45|2345|  2|  2|
| 45|2345|  3|  2|
+---+----+---+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 PySpark:获取数据框中每列的第一个非空值 - PySpark: Get first Non-null value of each column in dataframe pyspark:grouby然后获得每组的最大值 - pyspark: grouby and then get max value of each group 按日期分组并对 Pyspark 中每个组的第一个值求和 - Grouping by date and summing the first values from each group in Pyspark 从 groupby 中每个组的上一列获取值 - Get value from previous column for each group in groupby Pandas从组中获取列的第一个和最后一个值 - Pandas get first and last value of column from group Python Pandas 旋转:如何在第一列中分组并为第二列中的每个唯一值创建一个新列 - Python Pandas pivoting: how to group in the first column and create a new column for each unique value from the second column 获取每个组中的特定值并将其添加为每个组中的新列 - get specific value in each group and add it as new column in each group 获取 pyspark 中每行值最大的列索引 - Get the column index where the value is maximum per each row in pyspark 使用 Pandas,根据第二列的最小值从数据框中的一列(对于每组)获取值 - With Pandas, get value from one column in dataframe (for each group), based on minimum value of second column 获取每列中的第一个出现值 - Get first occurrence value in each column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM