This question was answered previously for R. I want to do exactly this but in using pyspark.
make an index of the latest events
last_event_index <- cumsum(df$event) + 1
shift it by one to the right
last_event_index <- c(1, last_event_index[1:length(last_event_index) - 1])
get the dates of the events and index the vector with the last_event_index
, #added an NA as the first date because there was no event
last_event_date <- c(as.Date(NA), df[which(df$event==1), "date"])[last_event_index]
#substract the event's date with the date of the last event
df$tae <- df$date - last_event_date
df
| date |event | tae|
|-------------|------|-------|
|#1 2000-07-06| 0| NA days
|#2 2000-09-15| 0| NA days
|#3 2000-10-15| 1| NA days
|#4 2001-01-03| 0| 80 days
|#5 2001-03-17| 1| 153 days
|#6 2001-05-23| 1| 67 days
|#7 2001-08-26| 0| 95 days
Using Window last
function to get previous event date then datediff
with current row date:
from pyspark.sql import functions as F, Window
w = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, -1)
result = df.withColumn(
"last_event_date",
F.last(F.when(F.col("event") == 1, F.col("date")), ignorenulls=True).over(w)
).withColumn(
"tae",
F.concat(
F.coalesce(F.datediff("date", "last_event_date"), F.lit("NA")),
F.lit(" days")
)
).drop("last_event_date")
result.show()
#+----------+-----+--------+
#| date|event| tae|
#+----------+-----+--------+
#|2000-07-06| 0| NA days|
#|2000-09-15| 0| NA days|
#|2000-10-15| 1| NA days|
#|2001-01-03| 0| 80 days|
#|2001-03-17| 1|153 days|
#|2001-05-23| 1| 67 days|
#|2001-08-26| 0| 95 days|
#+----------+-----+--------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.