[英]Increment value in each partition based on change in one column in pyspark
我想为 PySpark DataFrame 中的每个分区创建一个新列(数字),当列年份发生变化时,它会递增。
原始数据:
name period year
A 1 2010
A 1 2010
A 1 2011
A 1 2013
B 1 2018
B 1 2019
C 2 2018
C 2 2018
C 2 2019
预计 Output:
name period year number
A 1 2010 1
A 1 2010 1
A 1 2011 2
A 1 2013 3
B 1 2018 1
B 1 2019 2
C 2 2018 1
C 2 2018 1
C 2 2019 2
创建您提供的示例 dataframe:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
data = [{"name":'A', "period":1, "year":2010},
{"name":'A', "period":1, "year":2010},
{"name":'A', "period":1, "year":2011},
{"name":'A', "period":1, "year":2013},
{"name":'B', "period":1, "year":2018},
{"name":'B', "period":1, "year":2019},
{"name":'C', "period":2, "year":2018},
{"name":'C', "period":2, "year":2018},
{"name":'C', "period":2, "year":2019}]
df = spark.createDataFrame(data)
使用 window function 对 dataframe 进行分区,然后根据该分区对 dense_rank 进行分区:
window = (Window.partitionBy('name').orderBy(F.col('year').asc()))
df = df.withColumn('number', F.dense_rank().over(window)).orderBy("name", "year")
结果:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.