[英]How to replace any null in pyspark df with value from the below row, same column
Let's say I have a pyspark DF:假设我有一个 pyspark DF:
| Column A | Column B |
| -------- | -------- |
| val1 | val1B |
| null | val2B |
| val2 | null |
| val3 | val3B |
Can someone help me with replacing any null value in any column (for the whole df) with the value right below it?有人可以帮我用它正下方的值替换任何列(对于整个 df)中的任何 null 值吗? So the final table should look like this:所以决赛桌应该是这样的:
Column A A列 | Column B B列 |
---|---|
val1 val1 | val1B val1B |
val2值2 | val2B val2B |
val2值2 | val3B val3B |
val3值3 | val3B val3B |
How could this be done?这怎么可能呢? Can I get a code demo if possible?如果可能,我可以获得代码演示吗? Thank you!谢谢!
All I've really gotten through is counting all the row nums and creating a condition to find the row nums with all of the null values.我真正完成的是计算所有行号并创建一个条件来查找具有所有 null 值的行号。 So I'm left with a table like this:所以我剩下一张这样的桌子:
Column A A列 | Column B B列 | row_num行号 |
---|---|---|
null null | val2B val2B | 2 2个 |
val2值2 | null null | 3 3个 |
But I don't think this step is needed.但我认为不需要这一步。 I'm stuck as to what to do.我不知道该怎么做。
Use list squares to coalesce each column with the lead window function. Code below使用列表方块将每一列与前导 window function 合并。代码如下
df.select(*[coalesce(col(x),lead(x).over(Window.partitionBy().orderBy( monotonically_increasing_id()))).alias(x) for x in df.columns]).show()
+--------+--------+
|Column A|Column B|
+--------+--------+
| val1| val1B|
| val2| val2B|
| val2| val3B|
| val3| val3B|
+--------+--------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.