[英]How to apply window function on one column in pyspark?
您可以使用density_rank對此進行歸檔 ,請參見以下示例:
from datetime import datetime
from pyspark.sql.window import *
import pyspark.sql.functions as F
data = [
("def",None,20, datetime(2017, 3, 12, 3, 19, 58)),
("ab",None, 20, datetime(2017, 3, 12, 3, 21, 30)),
("test",20, None, datetime(2017, 3, 13, 3, 29, 40)),
("def",20, None, datetime(2017, 3, 13, 3, 31, 23))
]
df = sqlContext.createDataFrame(data, ["tbl_name","2019","2020","updated_on"])
df.show()
+--------+----+----+-------------------+
|tbl_name|2019|2020| updated_on|
+--------+----+----+-------------------+
| def|null|null|2017-03-12 03:19:58|
| ab|null| 20|2017-03-12 03:21:30|
| test| 20|null|2017-03-13 03:29:40|
| def| 20|null|2017-03-13 03:31:23|
+--------+----+----+-------------------+
然后應用密集等級:
wd = Window.partitionBy("tbl_name").orderBy(F.col("updated_on").asc())
wa = Window.partitionBy("tbl_name").orderBy(F.col("updated_on").desc())
df2 = df.select("tbl_name",
F.first("2019", ignorenulls=True).over(wa).alias("2019"),
F.first("2020", ignorenulls=True).over(wa).alias("2020"),
"updated_on",
F.dense_rank().over(wd).alias("rank")).filter(F.col("rank")==1).drop("rank")
導致:
+--------+----+----+-------------------+
|tbl_name|2019|2020| updated_on|
+--------+----+----+-------------------+
| ab|null| 20|2017-03-12 03:21:30|
| test| 20|null|2017-03-13 03:29:40|
| def| 20| 20|2017-03-12 03:19:58|
+--------+----+----+-------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.