Dataframe chaining operations to update a column using values from another column with specific condition

Question

I have a dataframe with following schema

timestamp	uuid	row_id	row_num	col_id	col_num
7/28 10:30	abc123	aaa	1	zzz567	1
7/28 10:30	abc123	aaa	1	zzz568	2
7/28 10:30	abc123	aaa	1	zzz569	3
7/28 10:30	abc123	aaa	null	zzz570	4
7/28 10:30	abc123	aaa	null	zzz571	5
7/28 10:30	abc123	bbb	2	yyy567	1
7/28 10:30	abc123	bbb	2	yyy568	2
7/28 10:30	abc123	bbb	2	yyy569	3
7/28 10:30	abc123	bbb	null	yyy570	4

Now, row num is always null when col num goes above > 3. In cases like these, I would want to impute those null row numbers with the first non-null row number value for that row id. So my output should essentially look like

timestamp	uuid	row_id	row_num	col_id	col_num
7/28 10:30	abc123	aaa	1	zzz567	1
7/28 10:30	abc123	aaa	1	zzz568	2
7/28 10:30	abc123	aaa	1	zzz569	3
7/28 10:30	abc123	aaa	1	zzz570	4
7/28 10:30	abc123	aaa	1	zzz571	5
7/28 10:30	abc123	bbb	2	yyy567	1
7/28 10:30	abc123	bbb	2	yyy568	2
7/28 10:30	abc123	bbb	2	yyy569	3
7/28 10:30	abc123	bbb	2	yyy570	4

I'm trying to do this using row_number and withColumn functions (I'm on Spark 2.2, so can't use nth_value function). I'm trying to do something like

df \
.withColumn('rn', F.row_number() \
.over(Window.partitionBy'uuid', 'row_id') \
.orderBy('timestamp', 'row_num'))) \
.withColumn('imputed_row_num', F.when((F.col('row_id').isNotNull() & 
F.col('row_num').isNull()), df.filter(df.rn == 1).row_num) \
.otherwise(F.col('row_num')))

But this throws an error saying rn is not defined. I'm aware this could be done through join, however, wanted to check if it's also possible to get the deried results with chaining operations.

Answer 1

We can use last with true as second parameter (which is to ignore nulls), so:

.withColumn("new", expr("last(row_num, true) over (partition by uuid,row_id order by timestamp)"))

will create a new column called new that holds that last non-null value from column row_num .

Next, we overwrite row_num column if the value is null with the value from new column:

.withColumn("row_num",
    when(col("row_num").isNull, col("new")).otherwise(col("row_num"))
)

We finally drop new column:

.drop("new")

Final output (without specific ordering):

+----------+------+------+-------+------+-------+
| timestamp|  uuid|row_id|row_num|col_id|col_num|
+----------+------+------+-------+------+-------+
|7/28 10:30|abc123|   bbb|      2|yyy567|      1|
|7/28 10:30|abc123|   bbb|      2|yyy567|      2|
|7/28 10:30|abc123|   bbb|      2|yyy567|      3|
|7/28 10:30|abc123|   bbb|      2|yyy567|      4|
|7/28 10:30|abc123|   aaa|      1|zzz567|      1|
|7/28 10:30|abc123|   aaa|      1|zzz567|      2|
|7/28 10:30|abc123|   aaa|      1|zzz567|      3|
|7/28 10:30|abc123|   aaa|      1|zzz567|      4|
|7/28 10:30|abc123|   aaa|      1|zzz567|      5|
+----------+------+------+-------+------+-------+

Dataframe chaining operations to update a column using values from another column with specific condition

Question

1 answers

solution1
0 2022-07-28 21:43:59

Dataframe chaining operations to update a column using values from another column with specific condition

Question

1 answers

solution1 0 2022-07-28 21:43:59

solution1
0 2022-07-28 21:43:59