I have a pyspark dataframe with 3 columns:
ID , each appearing multiple times; DATE ; DELAY , 0 if this bill was payed on time, 1 otherwise.
It's already ordered by ID and DATE . I need to create a column named CONSECUTIVE that shows how many consecutive bills were paid consecutively with DELAY=1 for each ID.
Example of data, and expected result:
ID | DATE | DELAY | CONSECUTIVE
101 | 1 | 1 | 1
101 | 2 | 1 | 2
101 | 3 | 1 | 3
101 | 4 | 0 | 0
101 | 5 | 1 | 1
101 | 6 | 1 | 2
213 | 1 | 1 | 1
213 | 2 | 1 | 2
Is there a way to do it without using Pandas? If so, how do I do it?
You can do this with 3 transformation with help of window.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
(101, 1, 1),
(101, 2, 1), # dasd
(101, 3, 0),
(101, 4, 1)
], ["id", 'date', 'delay'])
window = Window.partitionBy('id').orderBy('date')
last_value = F.last('rank').over(window.rowsBetween(-2, -1))
consecutive = F.when( F.col('delay')==0, 0) \
.otherwise( F.when(F.col('last_rank').isNull(), 1) \
.otherwise( F.col('last_rank')+1))
df \
.withColumn('rank', F.row_number().over(window)) \
.withColumn('rank', F.when(F.col('delay')!=0, F.col('rank')).otherwise(0)) \
.withColumn('last_rank', last_value) \
.withColumn('consecutive', consecutive).show()
results:
+---+----+-----+----+---------+-----------+
| id|date|delay|rank|last_rank|consecutive|
+---+----+-----+----+---------+-----------+
|101| 1| 1| 1| null| 1|
|101| 1| 1| 2| 1| 2|
|101| 1| 0| 0| 2| 0|
|101| 1| 1| 4| 0| 1|
+---+----+-----+----+---------+-----------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.