[英]pyspark code ranking partition question where i did wrong?
我有一個數據集 df 如下:
ID date class
1 2020/01/02 [math,english]
1 2020/01/03 [math,english]
1 2020/01/04 [math,english]
2 2020/01/02 [math,english]
2 2020/01/03 [math,english,art]
2 2020/01/04 [math,english]
2 2020/01/05 [math,english,art]
2 2020/01/06 [math,art]
2 2020/01/07 [math,art]
2 2020/01/08 [math,english,art]
我當前的代碼是:
df.withColumn("c_order", rank()\
.over(Window.partitionBy("ID","date")\
.orderBy("class")))\
我也嘗試了dense_rank()和row_number(),但沒有一個能提供output的願望。
df.withColumn("c_order", dense_rank()\
.over(Window.partitionBy("ID","date")\
.orderBy("class")))\
df.withColumn("c_order", row_number()\
.over(Window.partitionBy("ID","date")\
.orderBy("class")))\
我當前的 output 如下所示:
ID date class c_order
1 2020/01/02 [math,english] 1
1 2020/01/03 [math,english] 1
1 2020/01/04 [math,english] 1
2 2020/01/02 [math,english] 1
2 2020/01/03 [math,english,art] 1
2 2020/01/04 [math,english] 1
2 2020/01/05 [math,english,art] 1
2 2020/01/06 [math,art] 1
2 2020/01/07 [math,art] 1
2 2020/01/08 [math,english,art] 1
我想要 output 如下
ID date class c_order
1 2020/01/02 [math,english] 1
1 2020/01/03 [math,english] 1
1 2020/01/04 [math,english] 1
2 2020/01/02 [math,english] 1
2 2020/01/03 [math,english,art] 2
2 2020/01/04 [math,english] 3
2 2020/01/05 [math,english,art] 4
2 2020/01/06 [math,art] 5
2 2020/01/07 [math,art] 5
2 2020/01/08 [math,english,art] 6
僅當 class 發生變化時,訂單才會增加。 知道我在哪里做錯了嗎?
謝謝!
你不能只做排名。 您需要與上一行(使用lag
)進行比較,以檢查 class 何時發生變化。
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'diff',
F.coalesce(
F.col('class') != F.lag('class').over(Window.partitionBy('ID').orderBy('date')),
F.lit(False)
).cast('int')
).withColumn(
'c_order',
F.sum('diff').over(Window.partitionBy('ID').orderBy('date')) + 1
)
df2.show()
+---+----------+------------------+----+-------+
| ID| date| class|diff|c_order|
+---+----------+------------------+----+-------+
| 1|2020/01/02| [math,english]| 0| 1|
| 1|2020/01/03| [math,english]| 0| 1|
| 1|2020/01/04| [math,english]| 0| 1|
| 2|2020/01/02| [math,english]| 0| 1|
| 2|2020/01/03|[math,english,art]| 1| 2|
| 2|2020/01/04| [math,english]| 1| 3|
| 2|2020/01/05|[math,english,art]| 1| 4|
| 2|2020/01/06| [math,art]| 1| 5|
| 2|2020/01/07| [math,art]| 0| 5|
| 2|2020/01/08|[math,english,art]| 1| 6|
+---+----------+------------------+----+-------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.