簡體   English   中英

Pyspark - 刪除組的重復項並保留第一行

[英]Pyspark - Drop Duplicates of group and keep first row

如何采取的最大(值) df.value ,下降的重復值df.max_value (第一保持)在同一天之內,並通過團體?

+---+-------------------+-----+----------+
| id|               date|value| date_only|
+---+-------------------+-----+----------+
| J6|2019-10-01 00:00:00| Null|2016-10-01| 
| J6|2019-10-01 01:00:00|    1|2016-10-01|
| J6|2019-10-01 12:30:30|    3|2016-10-01|
| J6|2019-10-01 12:30:30|    3|2016-10-01|
| J2|2019-10-06 00:00:00|    9|2016-10-06|
| J2|2019-10-06 09:20:00|    9|2016-10-06|
| J2|2019-10-06 09:20:00|    1|2016-10-06|
| J2|2019-10-06 09:20:00|    9|2016-10-06|
+---+-------------------+-----+----------+

所需的數據幀:

+---+-------------------+-----+----------+---------+
| id|               date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| Null|2016-10-01|        3|
| J6|2019-10-01 01:00:00|    1|2016-10-01|     Null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     Null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     Null|
| J2|2019-10-06 00:00:00|    9|2016-10-06|        9|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     Null|
| J2|2019-10-06 09:20:00|    1|2016-10-06|     Null|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     Null|
+---+-------------------+-----+----------+---------+

使用max()row_number()

from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window

w=Window().partitionBy("id", "date_only").orderBy("date")

df.withColumn('max_value', F.when(F.row_number().over(w)==1, F.max('value')\
        .over(Window().partitionBy("id", "date_only")))).show()

+---+-------------------+-----+----------+---------+
| id|               date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| null|2016-10-01|        3|
| J6|2019-10-01 01:00:00|    1|2016-10-01|     null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     null|
| J2|2019-10-06 00:00:00|    9|2016-10-06|        9|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     null|
| J2|2019-10-06 09:20:00|    1|2016-10-06|     null|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     null|
+---+-------------------+-----+----------+---------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM