简体   繁体   English

Pyspark - 删除组的重复项并保留第一行

[英]Pyspark - Drop Duplicates of group and keep first row

How do I take the max(value) of df.value , drop duplicate values of df.max_value (keeping first) within the same day and by groups?如何采取的最大(值) df.value ,下降的重复值df.max_value (第一保持)在同一天之内,并通过团体?

+---+-------------------+-----+----------+
| id|               date|value| date_only|
+---+-------------------+-----+----------+
| J6|2019-10-01 00:00:00| Null|2016-10-01| 
| J6|2019-10-01 01:00:00|    1|2016-10-01|
| J6|2019-10-01 12:30:30|    3|2016-10-01|
| J6|2019-10-01 12:30:30|    3|2016-10-01|
| J2|2019-10-06 00:00:00|    9|2016-10-06|
| J2|2019-10-06 09:20:00|    9|2016-10-06|
| J2|2019-10-06 09:20:00|    1|2016-10-06|
| J2|2019-10-06 09:20:00|    9|2016-10-06|
+---+-------------------+-----+----------+

Desired Dataframe:所需的数据帧:

+---+-------------------+-----+----------+---------+
| id|               date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| Null|2016-10-01|        3|
| J6|2019-10-01 01:00:00|    1|2016-10-01|     Null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     Null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     Null|
| J2|2019-10-06 00:00:00|    9|2016-10-06|        9|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     Null|
| J2|2019-10-06 09:20:00|    1|2016-10-06|     Null|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     Null|
+---+-------------------+-----+----------+---------+

Using a combination of max() and row_number() :使用max()row_number()

from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window

w=Window().partitionBy("id", "date_only").orderBy("date")

df.withColumn('max_value', F.when(F.row_number().over(w)==1, F.max('value')\
        .over(Window().partitionBy("id", "date_only")))).show()

+---+-------------------+-----+----------+---------+
| id|               date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| null|2016-10-01|        3|
| J6|2019-10-01 01:00:00|    1|2016-10-01|     null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     null|
| J2|2019-10-06 00:00:00|    9|2016-10-06|        9|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     null|
| J2|2019-10-06 09:20:00|    1|2016-10-06|     null|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     null|
+---+-------------------+-----+----------+---------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM