Replace a value in a row with a value in another row in Spark dataframe

Question

I have a dataframe like so:

id	startValue	endValue
1	null	11a
1	554	22b
2	null	33c
2	6743	44d

Assume that we'll always have 2 rows with the same id , one where startValue has value and another where startValue is always null. I'd like to replace the null values in the startValue with startValue-10 , where the startValue is taken from the row with same id where startValue is not null.

id	startValue	endValue
1	544	11a
1	554	22b
2	6733	33c
2	6743	44d

Sample data frame:

val df = Seq(
("1", null, "11a"),
("1", 554, "22b"),
("2", null, "33c"),
("2", 6743, "44d"),
).toDF("id", "startValue", "endValue")

Answer 1

You can coalesce the nulls with the other startValue found in the same partition of id , minus 10:

import org.apache.spark.sql.expressions.Window

val df2 = df.withColumn(
    "startValue",
    coalesce($"startValue", max($"startValue").over(Window.partitionBy("id")) - 10)
)

df2.show
+---+----------+--------+
| id|startValue|endValue|
+---+----------+--------+
|  1|       544|     11a|
|  1|       554|     22b|
|  2|      6733|     33c|
|  2|      6743|     44d|
+---+----------+--------+

Replace a value in a row with a value in another row in Spark dataframe

Question

1 answers

solution1
0 ACCPTED 2021-01-26 16:26:13

Replace a value in a row with a value in another row in Spark dataframe

Question

1 answers

solution1 0 ACCPTED 2021-01-26 16:26:13

solution1
0 ACCPTED 2021-01-26 16:26:13