How to calculate churn in Pyspark

Question

Does anyone know how to apply a churn rule on the dataset below? The goal is to create a column called "churn" and use it to informing if it is true or false to whenever the Id remains "false" for more than 30 consecutive days in the "using" column

I already tried to work with window function but I didn't have success

Answer 1

Create a window function groupby by the id and ordering by date. Set the window to be between the current row and the previous 30 rows. To create the column, take the max of the using column which wil return True if any date has Using == True in the past 30 days. Finally, negate that value with ~ because you're interested only when NO True is found within a 30 day window.

from pyspark.sql import Window, functions as F

w = (
    Window()
    .partitionBy("id")
    .orderBy("reference_date")
    .rowsBetween(start=Window.currentRow - 30, end=Window.currentRow)
)


df.withColumn('churn', ~F.max('using').over(w)).display()

How to calculate churn in Pyspark

Question

1 answers

solution1
1 ACCPTED 2022-11-26 01:54:20

How to calculate churn in Pyspark

Question

1 answers

solution1 1 ACCPTED 2022-11-26 01:54:20

solution1
1 ACCPTED 2022-11-26 01:54:20