简体   繁体   中英

How to calculate churn in Pyspark

Does anyone know how to apply a churn rule on the dataset below? The goal is to create a column called "churn" and use it to informing if it is true or false to whenever the Id remains "false" for more than 30 consecutive days in the "using" column

在此处输入图像描述

I already tried to work with window function but I didn't have success

Create a window function groupby by the id and ordering by date. Set the window to be between the current row and the previous 30 rows. To create the column, take the max of the using column which wil return True if any date has Using == True in the past 30 days. Finally, negate that value with ~ because you're interested only when NO True is found within a 30 day window.

from pyspark.sql import Window, functions as F

w = (
    Window()
    .partitionBy("id")
    .orderBy("reference_date")
    .rowsBetween(start=Window.currentRow - 30, end=Window.currentRow)
)


df.withColumn('churn', ~F.max('using').over(w)).display()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM