繁体   English   中英

每个分区的增量值基于 pyspark 中一列的变化

[英]Increment value in each partition based on change in one column in pyspark

我想为 PySpark DataFrame 中的每个分区创建一个新列(数字),当列年份发生变化时,它会递增。

原始数据:

name period year 
A    1      2010
A    1      2010
A    1      2011
A    1      2013
B    1      2018
B    1      2019
C    2      2018
C    2      2018
C    2      2019

预计 Output:

name period year  number
A    1      2010  1
A    1      2010  1
A    1      2011  2
A    1      2013  3
B    1      2018  1
B    1      2019  2
C    2      2018  1
C    2      2018  1
C    2      2019  2

创建您提供的示例 dataframe:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

data = [{"name":'A', "period":1, "year":2010},
        {"name":'A', "period":1, "year":2010},
        {"name":'A', "period":1, "year":2011},
        {"name":'A', "period":1, "year":2013},
        {"name":'B', "period":1, "year":2018},
        {"name":'B', "period":1, "year":2019},
        {"name":'C', "period":2, "year":2018},
        {"name":'C', "period":2, "year":2018},
        {"name":'C', "period":2, "year":2019}]

df = spark.createDataFrame(data)

使用 window function 对 dataframe 进行分区,然后根据该分区对 dense_rank 进行分区:

window = (Window.partitionBy('name').orderBy(F.col('year').asc()))

df = df.withColumn('number', F.dense_rank().over(window)).orderBy("name", "year")

结果:

在此处输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM