I have a dataframe like this (pyspark)
customer | date | col_x | day_number |
---|---|---|---|
1 | 5/10/2022 | ...(val doesn't matter) | NULL |
1 | 5/11/2022 | ... | NULL |
1 | 5/12/2022 | ... | 0 |
1 | 5/13/2022 | ... | NULL |
1 | 5/14/2022 | ... | NULL |
1 | 5/15/2022 | ... | 3 |
1 | 5/16/2022 | ... | 4 |
1 | 5/17/2022 | ... | NULL |
1 | 5/18/2022 | ... | 6 |
2 | 5/10/2022 | ...(val doesn't matter) | NULL |
2 | 5/11/2022 | ... | NULL |
2 | 5/12/2022 | ... | 0 |
2 | 5/13/2022 | ... | NULL |
2 | 5/14/2022 | ... | 2 |
2 | 5/15/2022 | ... | 3 |
2 | 5/16/2022 | ... | NULL |
2 | 5/17/2022 | ... | 5 |
2 | 5/18/2022 | ... | NULL |
and this pattern repeats for a number of customer.
What I would like to have is to fill the number between (for example for customer 1 on 5/13 and 5/14 the day number would be 1 and 2) and it should always be increasing by 1 since EVERY date value is populated.
I have tried a lag function and I believe this would work (code below) but I am wondering if there is any other way to do this, perhaps with row_number()
starting from the first non-null day_number
value for each grouping.
This is what I have tried
import pyspark.sql.functions as F
window = Window.partitionBy("customer").orderBy("date")
#LAG FUNCTION SEEMS TO WORK
df.select(
F.coalesce(F.col("day_number"), F.lit(F.lag(F.col("day_number")).over(window)) + 1)
)
#SOME ROW_NUMBER() FUNCTION , DOES NOT WORK --
df.select(
F.coalesce(F.col("day_number"), F.row_number().over(window) + F.lit(F.min("day_number").over(window) - 1)).alias("day_number")
)
Would be curious if/what the other solutions are out there, I also was thinking there is a way to use F.last()
, but there would need to be a way to add the offset (+1,+2,+3) etc...
Thanks!
one solution can definitely be done this way:
window = Window.partitionBy("customer").orderBy("date")
df.select(F.when(F.datediff(F.col("date"),F.first(F.when(F.col("day_number") == 0, F.col("date")), ignorenulls=True).over(fill_window))>=0,F.datediff(F.col("date"),F.first(F.when(F.col("day_number") == 0, F.col("date")), ignorenulls=True).over(fill_window))).alias("day_number_test")
but it doesn't seem like the prettiest solution, I'd be curious to know if any others exist which incorporate into a window function right away.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.