![](/img/trans.png)
[英]Python Pandas Getting Values Based on Value of Another Column, Finding Max Value in Column Less Than Current Value
[英]Add column to Spark dataframe with the max value that is less than the current record's value
我有一个类似于以下内容的 Spark dataframe:
id claim_id service_date status product
123 10606134411906233408 2018-09-17T00:00:00.000+0000 PD blue
123 10606147900401009928 2019-01-24T00:00:00.000+0000 PD yellow
123 10606160940704723994 2019-05-23T00:00:00.000+0000 RV yellow
123 10606171648203079553 2019-08-29T00:00:00.000+0000 RJ blue
123 10606186611407311724 2020-01-13T00:00:00.000+0000 PD blue
请原谅我没有粘贴任何代码,因为没有任何效果。 我想添加一个新列,其中状态为 PD 的前一行的 max(service_date) 和当前行的乘积 = 前一行的乘积。
这很容易通过相关子查询完成,但效率不高,此外,在 Spark 中也不可行,因为不支持非 equi 连接。 另请注意,LAG 将不起作用,因为我并不总是需要紧接在前的记录(并且偏移量是动态的)。
预期的 output 将是这样的:
id claim_id service_date status product previous_service_date
123 10606134411906233408 2018-09-17T00:00:00.000+0000 PD blue
123 10606147900401009928 2019-01-24T00:00:00.000+0000 PD yellow
123 10606160940704723994 2019-05-23T00:00:00.000+0000 RV yellow 2019-01-24T00:00:00.000+0000
123 10606171648203079553 2019-08-29T00:00:00.000+0000 RJ blue 2018-09-17T00:00:00.000+0000
123 10606186611407311724 2020-01-13T00:00:00.000+0000 PD blue 2018-09-17T00:00:00.000+0000
您可以尝试以下使用max
作为 window function 和when
(一个案例表达式)但关注前面的行
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('previous_service_date',F.max(
F.when(F.col("status")=="PD",F.col("service_date")).otherwise(None)
).over(
Window.partitionBy("product")
.rowsBetween(Window.unboundedPreceding,-1)
))
df.orderBy('service_date').show(truncate=False)
+---+--------------------+-------------------+------+-------+---------------------+
|id |claim_id |service_date |status|product|previous_service_date|
+---+--------------------+-------------------+------+-------+---------------------+
|123|10606134411906233408|2018-09-17 00:00:00|PD |blue |null |
|123|10606147900401009928|2019-01-24 00:00:00|PD |yellow |null |
|123|10606160940704723994|2019-05-23 00:00:00|RV |yellow |2019-01-24 00:00:00 |
|123|10606171648203079553|2019-08-29 00:00:00|RJ |blue |2018-09-17 00:00:00 |
|123|10606186611407311724|2020-01-13 00:00:00|PD |blue |2018-09-17 00:00:00 |
+---+--------------------+-------------------+------+-------+---------------------+
编辑 1
您也可以使用last
如下所示
df = df.withColumn('previous_service_date',F.last(
F.when(F.col("status")=="PD" ,F.col("service_date")).otherwise(None),True
).over(
Window.partitionBy("product")
.orderBy('service_date')
.rowsBetween(Window.unboundedPreceding,-1)
))
如果这对你有用,请告诉我。
您可以将您的 DataFrame copy
到一个新的 DataFrame ( df2
) 并join
两者,如下所示:
(df.join(df2,
on = [df.Service_date > df2.Service_date,
df.product == df2.product,
df2.status == 'PD'],
how = "left"))
删除重复的列并将df2.Service_date
重命名为previous_service_date
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.