I have a dataframe with following schema
timestamp | uuid | row_id | row_num | col_id | col_num |
---|---|---|---|---|---|
7/28 10:30 | abc123 | aaa | 1 | zzz567 | 1 |
7/28 10:30 | abc123 | aaa | 1 | zzz568 | 2 |
7/28 10:30 | abc123 | aaa | 1 | zzz569 | 3 |
7/28 10:30 | abc123 | aaa | null | zzz570 | 4 |
7/28 10:30 | abc123 | aaa | null | zzz571 | 5 |
7/28 10:30 | abc123 | bbb | 2 | yyy567 | 1 |
7/28 10:30 | abc123 | bbb | 2 | yyy568 | 2 |
7/28 10:30 | abc123 | bbb | 2 | yyy569 | 3 |
7/28 10:30 | abc123 | bbb | null | yyy570 | 4 |
Now, row num is always null when col num goes above > 3. In cases like these, I would want to impute those null row numbers with the first non-null row number value for that row id. So my output should essentially look like
timestamp | uuid | row_id | row_num | col_id | col_num |
---|---|---|---|---|---|
7/28 10:30 | abc123 | aaa | 1 | zzz567 | 1 |
7/28 10:30 | abc123 | aaa | 1 | zzz568 | 2 |
7/28 10:30 | abc123 | aaa | 1 | zzz569 | 3 |
7/28 10:30 | abc123 | aaa | 1 | zzz570 | 4 |
7/28 10:30 | abc123 | aaa | 1 | zzz571 | 5 |
7/28 10:30 | abc123 | bbb | 2 | yyy567 | 1 |
7/28 10:30 | abc123 | bbb | 2 | yyy568 | 2 |
7/28 10:30 | abc123 | bbb | 2 | yyy569 | 3 |
7/28 10:30 | abc123 | bbb | 2 | yyy570 | 4 |
I'm trying to do this using row_number and withColumn functions (I'm on Spark 2.2, so can't use nth_value function). I'm trying to do something like
df \
.withColumn('rn', F.row_number() \
.over(Window.partitionBy'uuid', 'row_id') \
.orderBy('timestamp', 'row_num'))) \
.withColumn('imputed_row_num', F.when((F.col('row_id').isNotNull() &
F.col('row_num').isNull()), df.filter(df.rn == 1).row_num) \
.otherwise(F.col('row_num')))
But this throws an error saying rn is not defined. I'm aware this could be done through join, however, wanted to check if it's also possible to get the deried results with chaining operations.
We can use last
with true
as second parameter (which is to ignore nulls), so:
.withColumn("new", expr("last(row_num, true) over (partition by uuid,row_id order by timestamp)"))
will create a new column called new
that holds that last non-null value from column row_num
.
Next, we overwrite row_num
column if the value is null with the value from new
column:
.withColumn("row_num",
when(col("row_num").isNull, col("new")).otherwise(col("row_num"))
)
We finally drop new
column:
.drop("new")
Final output (without specific ordering):
+----------+------+------+-------+------+-------+
| timestamp| uuid|row_id|row_num|col_id|col_num|
+----------+------+------+-------+------+-------+
|7/28 10:30|abc123| bbb| 2|yyy567| 1|
|7/28 10:30|abc123| bbb| 2|yyy567| 2|
|7/28 10:30|abc123| bbb| 2|yyy567| 3|
|7/28 10:30|abc123| bbb| 2|yyy567| 4|
|7/28 10:30|abc123| aaa| 1|zzz567| 1|
|7/28 10:30|abc123| aaa| 1|zzz567| 2|
|7/28 10:30|abc123| aaa| 1|zzz567| 3|
|7/28 10:30|abc123| aaa| 1|zzz567| 4|
|7/28 10:30|abc123| aaa| 1|zzz567| 5|
+----------+------+------+-------+------+-------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.