[英]Pyspark - Drop Duplicates of group and keep first row
如何采取的最大(值) df.value
,下降的重复值df.max_value
(第一保持)在同一天之内,并通过团体?
+---+-------------------+-----+----------+
| id| date|value| date_only|
+---+-------------------+-----+----------+
| J6|2019-10-01 00:00:00| Null|2016-10-01|
| J6|2019-10-01 01:00:00| 1|2016-10-01|
| J6|2019-10-01 12:30:30| 3|2016-10-01|
| J6|2019-10-01 12:30:30| 3|2016-10-01|
| J2|2019-10-06 00:00:00| 9|2016-10-06|
| J2|2019-10-06 09:20:00| 9|2016-10-06|
| J2|2019-10-06 09:20:00| 1|2016-10-06|
| J2|2019-10-06 09:20:00| 9|2016-10-06|
+---+-------------------+-----+----------+
所需的数据帧:
+---+-------------------+-----+----------+---------+
| id| date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| Null|2016-10-01| 3|
| J6|2019-10-01 01:00:00| 1|2016-10-01| Null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| Null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| Null|
| J2|2019-10-06 00:00:00| 9|2016-10-06| 9|
| J2|2019-10-06 09:20:00| 9|2016-10-06| Null|
| J2|2019-10-06 09:20:00| 1|2016-10-06| Null|
| J2|2019-10-06 09:20:00| 9|2016-10-06| Null|
+---+-------------------+-----+----------+---------+
使用max()
和row_number()
:
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window
w=Window().partitionBy("id", "date_only").orderBy("date")
df.withColumn('max_value', F.when(F.row_number().over(w)==1, F.max('value')\
.over(Window().partitionBy("id", "date_only")))).show()
+---+-------------------+-----+----------+---------+
| id| date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| null|2016-10-01| 3|
| J6|2019-10-01 01:00:00| 1|2016-10-01| null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| null|
| J2|2019-10-06 00:00:00| 9|2016-10-06| 9|
| J2|2019-10-06 09:20:00| 9|2016-10-06| null|
| J2|2019-10-06 09:20:00| 1|2016-10-06| null|
| J2|2019-10-06 09:20:00| 9|2016-10-06| null|
+---+-------------------+-----+----------+---------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.