I have a dataframe like so:
id | startValue | endValue |
---|---|---|
1 | null | 11a |
1 | 554 | 22b |
2 | null | 33c |
2 | 6743 | 44d |
Assume that we'll always have 2 rows with the same id
, one where startValue
has value and another where startValue
is always null. I'd like to replace the null values in the startValue
with startValue-10
, where the startValue
is taken from the row with same id where startValue
is not null.
id | startValue | endValue |
---|---|---|
1 | 544 | 11a |
1 | 554 | 22b |
2 | 6733 | 33c |
2 | 6743 | 44d |
Sample data frame:
val df = Seq(
("1", null, "11a"),
("1", 554, "22b"),
("2", null, "33c"),
("2", 6743, "44d"),
).toDF("id", "startValue", "endValue")
You can coalesce
the nulls with the other startValue
found in the same partition of id
, minus 10:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"startValue",
coalesce($"startValue", max($"startValue").over(Window.partitionBy("id")) - 10)
)
df2.show
+---+----------+--------+
| id|startValue|endValue|
+---+----------+--------+
| 1| 544| 11a|
| 1| 554| 22b|
| 2| 6733| 33c|
| 2| 6743| 44d|
+---+----------+--------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.