Does anyone know how to apply a churn rule on the dataset below? The goal is to create a column called "churn" and use it to informing if it is true or false to whenever the Id remains "false" for more than 30 consecutive days in the "using" column
I already tried to work with window function but I didn't have success
Create a window function groupby by the id and ordering by date. Set the window to be between the current row and the previous 30 rows. To create the column, take the max of the using
column which wil return True if any date has Using == True
in the past 30 days. Finally, negate that value with ~
because you're interested only when NO True is found within a 30 day window.
from pyspark.sql import Window, functions as F
w = (
Window()
.partitionBy("id")
.orderBy("reference_date")
.rowsBetween(start=Window.currentRow - 30, end=Window.currentRow)
)
df.withColumn('churn', ~F.max('using').over(w)).display()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.