Spark Dataframes：使用Window PARTITION函数语法时的CASE语句

Question

I need to check a Condition whether if ReasonCode is "YES" , then use ProcessDate as one of the PARTITION column else do not. 我需要检查一个条件，如果ReasonCode是否为“ YES”，则将ProcessDate用作PARTITION列之一，否则不使用。

The equivalent SQL query is below: 等效的SQL查询如下：

SELECT PNum, SUM(SIAmt) OVER (PARTITION BY PNum,
                                           ReasonCode , 
                                           CASE WHEN ReasonCode = 'YES' THEN ProcessDate ELSE NULL END 
                              ORDER BY ProcessDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) SumAmt 
from TABLE1

I have tried so far the below query, but unable to incorporate the condition 到目前为止，我已经尝试了以下查询，但无法合并该条件

"CASE WHEN ReasonCode = 'YES' THEN ProcessDate ELSE NULL END" in Spark Dataframes Spark数据帧中的“ CASE WHEN ReasonCode ='YES'，然后ProcessDate ELSE NULL END”

val df = inputDF.select("PNum")
.withColumn("SumAmt", sum("SIAmt").over(Window.partitionBy("PNum","ReasonCode").orderBy("ProcessDate")))

Input Data: 输入数据：

---------------------------------------
Pnum    ReasonCode  ProcessDate SIAmt
---------------------------------------
1       No          1/01/2016   200
1       No          2/01/2016   300
1       Yes         3/01/2016   -200
1       Yes         4/01/2016   200
---------------------------------------

Expected Output: 预期产量：

---------------------------------------------
Pnum    ReasonCode  ProcessDate SIAmt  SumAmt
---------------------------------------------
1       No          1/01/2016   200     200 
1       No          2/01/2016   300     500
1       Yes         3/01/2016   -200    -200
1       Yes         4/01/2016   200      200
---------------------------------------------

Any Suggestion/help on Spark dataframe instead of spark-sql query ? 关于Spark数据框而不是spark-sql查询的任何建议/帮助吗？

Answer 1

You can apply the same exact copy of SQL in api form as 您可以应用与api形式相同的SQL完全相同的副本

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = inputDF
  .withColumn("SumAmt", sum("SIAmt").over(Window.partitionBy(col("PNum"),col("ReasonCode"), when(col("ReasonCode") === "Yes", col("ProcessDate")).otherwise(null)).orderBy("ProcessDate")))

You can add the .rowsBetween(Long.MinValue, 0) part too, which should give you 您也可以添加.rowsBetween(Long.MinValue, 0)部分，这应该给您

+----+----------+-----------+-----+------+
|Pnum|ReasonCode|ProcessDate|SIAmt|SumAmt|
+----+----------+-----------+-----+------+
|   1|       Yes|  4/01/2016|  200|   200|
|   1|        No|  1/01/2016|  200|   200|
|   1|        No|  2/01/2016|  300|   500|
|   1|       Yes|  3/01/2016| -200|  -200|
+----+----------+-----------+-----+------+

Spark Dataframes：使用Window PARTITION函数语法时的CASE语句

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-05-28 08:07:31

Spark Dataframes：使用Window PARTITION函数语法时的CASE语句

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-05-28 08:07:31

解决方案1
1 已采纳 2018-05-28 08:07:31