简体   繁体   中英

Dataframe chaining operations to update a column using values from another column with specific condition

I have a dataframe with following schema

timestamp uuid row_id row_num col_id col_num
7/28 10:30 abc123 aaa 1 zzz567 1
7/28 10:30 abc123 aaa 1 zzz568 2
7/28 10:30 abc123 aaa 1 zzz569 3
7/28 10:30 abc123 aaa null zzz570 4
7/28 10:30 abc123 aaa null zzz571 5
7/28 10:30 abc123 bbb 2 yyy567 1
7/28 10:30 abc123 bbb 2 yyy568 2
7/28 10:30 abc123 bbb 2 yyy569 3
7/28 10:30 abc123 bbb null yyy570 4

Now, row num is always null when col num goes above > 3. In cases like these, I would want to impute those null row numbers with the first non-null row number value for that row id. So my output should essentially look like

timestamp uuid row_id row_num col_id col_num
7/28 10:30 abc123 aaa 1 zzz567 1
7/28 10:30 abc123 aaa 1 zzz568 2
7/28 10:30 abc123 aaa 1 zzz569 3
7/28 10:30 abc123 aaa 1 zzz570 4
7/28 10:30 abc123 aaa 1 zzz571 5
7/28 10:30 abc123 bbb 2 yyy567 1
7/28 10:30 abc123 bbb 2 yyy568 2
7/28 10:30 abc123 bbb 2 yyy569 3
7/28 10:30 abc123 bbb 2 yyy570 4

I'm trying to do this using row_number and withColumn functions (I'm on Spark 2.2, so can't use nth_value function). I'm trying to do something like

df \
.withColumn('rn', F.row_number() \
.over(Window.partitionBy'uuid', 'row_id') \
.orderBy('timestamp', 'row_num'))) \
.withColumn('imputed_row_num', F.when((F.col('row_id').isNotNull() & 
F.col('row_num').isNull()), df.filter(df.rn == 1).row_num) \
.otherwise(F.col('row_num'))) 

But this throws an error saying rn is not defined. I'm aware this could be done through join, however, wanted to check if it's also possible to get the deried results with chaining operations.

We can use last with true as second parameter (which is to ignore nulls), so:

.withColumn("new", expr("last(row_num, true) over (partition by uuid,row_id order by timestamp)"))

will create a new column called new that holds that last non-null value from column row_num .

Next, we overwrite row_num column if the value is null with the value from new column:

.withColumn("row_num",
    when(col("row_num").isNull, col("new")).otherwise(col("row_num"))
) 

We finally drop new column:

.drop("new")

Final output (without specific ordering):

+----------+------+------+-------+------+-------+
| timestamp|  uuid|row_id|row_num|col_id|col_num|
+----------+------+------+-------+------+-------+
|7/28 10:30|abc123|   bbb|      2|yyy567|      1|
|7/28 10:30|abc123|   bbb|      2|yyy567|      2|
|7/28 10:30|abc123|   bbb|      2|yyy567|      3|
|7/28 10:30|abc123|   bbb|      2|yyy567|      4|
|7/28 10:30|abc123|   aaa|      1|zzz567|      1|
|7/28 10:30|abc123|   aaa|      1|zzz567|      2|
|7/28 10:30|abc123|   aaa|      1|zzz567|      3|
|7/28 10:30|abc123|   aaa|      1|zzz567|      4|
|7/28 10:30|abc123|   aaa|      1|zzz567|      5|
+----------+------+------+-------+------+-------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM