[英]How to pass columns as value in UDF in Spark Scala for checking condition
Here is my data frame 这是我的数据框
uniqueFundamentalSet PeriodId SourceId StatementTypeCode StatementCurrencyId UpdateReason_updateReasonId UpdateReasonComment UpdateReasonComment_languageId UpdateReasonEnumerationId FFAction|!| DataPartition PartitionYear TimeStamp
192730230775 297 182 INC 500186 null null null null O|!| Japan 2017 2018-05-10T10:11:15+00:00
192730230775 297 180 INC 500186 6 InsertUpdateReason 505074 3019685 I|!| Japan 2017 2018-05-10T10:00:40+00:00
192730230775 297 181 INC 500186 1 UpdateReason2Update 505074 3019680 I|!| Japan 2017 2018-05-10T10:00:40+00:00
192730230775 297 182 INC 500186 6 UpdateReasonToDelete 505074 3019685 I|!| Japan 2017 2018-05-10T10:00:40+00:00
192730230775 297 181 INC 500186 1 UpdateReason2UpdateIsNowUPdated 505074 3019680 I|!| Japan 2017 2018-05-10T10:08:01+00:00
192730230775 297 181 INC 500186 4 New Reason Added 505074 3019683 I|!| Japan 2017 2018-05-10T10:08:01+00:00
192730230775 297 180 INC 500186 6 InsertUpdateReason 505074 3019685 I|!| Japan 2017 2018-05-10T09:57:29+00:00
192730230775 297 181 INC 500186 1 UpdateReason2Update 505074 3019680 I|!| Japan 2017 2018-05-10T09:57:29+00:00
192730230775 297 182 INC 500186 6 UpdateReasonToDelete 505074 3019685 I|!| Japan 2017 2018-05-10T09:57:29+00:00
192730230775 308 180 BAL 500186 1 RevisedReasonAdded 505074 3019680 O|!| Japan 2017 2018-05-10T10:21:50+00:00
192730230775 308 180 BAL 500186 6 UpdateReasonToUpdateRevisedisNowUpdated 505074 3019685 O|!| Japan 2017 2018-05-10T10:21:50+00:00
192730230775 308 180 BAL 500186 1 RevisedReasonAdded 505074 3019680 O|!| Japan 2017 2018-05-10T10:27:09+00:00
192730230775 308 180 BAL 500186 6 UpdateReasonToUpdateRevisedisNowUpdated 505074 3019685 O|!| Japan 2017 2018-05-10T10:27:09+00:00
192730230775 308 179 BAL 500186 null null null null O|!| Japan 2017 2018-05-10T09:27:11+00:00
192730230775 308 181 BAL 500186 null null null null O|!| Japan 2017 2018-05-10T10:27:09+00:00
192730230775 308 180 BAL 500186 1 RevisedReasonAdded 505074 3019680 O|!| Japan 2017 2018-05-10T10:22:55+00:00
192730230775 308 180 BAL 500186 6 UpdateReasonToUpdateRevisedisNowUpdated 505074 3019685 O|!| Japan 2017 2018-05-10T10:22:55+00:00
192730230775 308 180 BAL 500186 6 UpdateReasonToUpdateRevised 505074 3019685 I|!| Japan 2017 2018-05-10T10:17:37+00:00
192730230775 308 181 BAL 500186 6 ReasonToDeleteRevised 505074 3019685 I|!| Japan 2017 2018-05-10T10:17:37+00:00
192730230775 298 180 BAL 500186 6 UpdateReasonToUpdateRevised 505074 3019685 I|!| Japan 2017 2018-05-10T10:17:37+00:00
192730230775 298 181 BAL 500186 6 ReasonToDeleteRevised 505074 3019685 I|!| Japan 2017 2018-05-10T10:17:37+00:00
192730230775 298 180 BAL 500186 1 RevisedReasonAdded 505074 3019680 I|!| Japan 2017 2018-05-10T10:22:55+00:00
192730230775 298 180 BAL 500186 6 UpdateReasonToUpdateRevisedisNowUpdated 505074 3019685 I|!| Japan 2017 2018-05-10T10:22:55+00:00
192730230775 298 180 BAL 500186 6 UpdateReasonToUpdateRevised 505074 3019685 I|!| Japan 2017 2018-05-10T10:16:31+00:00
192730230775 298 181 BAL 500186 6 ReasonToDeleteRevised 505074 3019685 I|!| Japan 2017 2018-05-10T10:16:31+00:00
192730230775 298 180 BAL 500186 1 RevisedReasonAdded 505074 3019680 I|!| Japan 2017 2018-05-10T10:21:50+00:00
192730230775 298 180 BAL 500186 6 UpdateReasonToUpdateRevisedisNowUpdated 505074 3019685 I|!| Japan 2017 2018-05-10T10:21:50+00:00
192730230775 312 181 BAL 500186 null null null null O|!| Japan 2018 2018-05-10T09:39:43+00:00
192730230775 310 181 INC 500186 null null null null D|!| Japan 9999 2018-05-10T08:21:26+00:00
192730230775 310 182 INC 500186 null null null null O|!| Japan 2018 2018-05-10T08:30:53+00:00
192730230775 298 181 BAL 500186 null null null null O|!| Japan 2017 2018-05-10T10:22:55+00:00
Here is logic to get by expected output 这是通过预期输出获得的逻辑
If "FFAction|!|"
如果为“ FFAction |!|” === "I|!|"
===“ I |!|” then group by first 6 columns and need to get latest based on Timestamp.
然后按前6列分组,并需要根据时间戳获取最新信息。
If If "FFAction|!|"
如果如果“ FFAction |!|” === "O|!|"
===“ O |!|” and $"UpdateReason_updateReasonId" === "null" or "FFAction|!|"
和$“ UpdateReason_updateReasonId” ===“ null”或“ FFAction |!|” === "D|!|"
===“ D |!|” then group by first 5 columns and need to get latest based on Timestamp.
然后按前5列分组,并需要根据时间戳获取最新信息。
If one row "FFAction|!|"
如果一行为“ FFAction |!|” === "I|!|"
===“ I |!|” and another "FFAction|!|"
和另一个“ FFAction |!|” === "O|!|"
===“ O |!|” in that case group by first five columns and need to get latest .
在这种情况下,请按前五列分组,并需要获取最新信息。
Same as If one row "FFAction|!|"
与如果一行“ FFAction |!|”相同 === "I|!|"
===“ I |!|” and another "FFAction|!|"
和另一个“ FFAction |!|” === "D|!|"
===“ D |!|” in that case group by first five columns and need to get latest .
在这种情况下,请按前五列分组,并需要获取最新信息。
Here is my expected output with explained logic . 这是我的预期输出,其中包含解释的逻辑。
Logic Example 1:
Lets take example of PeridoId 308 it has total 11 rows . 让我们以PeridoId 308为例,它总共有11行。 Now one row has PeriodId 308 and SourceId 179 and it is totaly different so it will be in output .
现在,一行具有PeriodId 308和SourceId 179,并且完全不同,因此将在输出中。 308 and 181 has two rows identical till columns 5 and out of that one has O so we need to group by 5 columns and take latest and latest should be Lastly 308 and 180 has 7 columns similar till row 5 and it does not have UpdateReason_updateReasonId as null in that case group by should be on 6 columns .
308和181在第5列之前有两行相同,而其中的第一个具有O,因此我们需要按5列进行分组,并且最新和最晚应该是5。最后308和181在第5行之前有7列相似,并且没有UpdateReason_updateReasonId作为在这种情况下,null group by应该在6列上。
And in that way latest will be 这样一来,最新的
192730230775 308 179 BAL 500186 null null null null O|!| Japan 2017 2018-05-10T09:27:11+00:00
192730230775 308 181 BAL 500186 null null null null O|!| Japan 2017 2018-05-10T10:27:09+00:00
192730230775 308 180 BAL 500186 6 UpdateReasonToUpdateRevisedisNowUpdated 505074 3019685 O|!| Japan 2017 2018-05-10T10:22:55+00:00
192730230775 308 180 BAL 500186 1 RevisedReasonAdded 505074 3019680 O|!| Japan 2017 2018-05-10T10:27:09+00:00
So this should be the final output for PeriodId 308 . 因此,这应该是PeriodId 308的最终输出。
Logic Example 2 :
Similary PeriodId 297 has 9 columns . 相似的PeriodId 297有9列。
Now it has three combination PeridoId 297 with SourceId 180,181,182 So there will be three rows .In that 297 and 181 has similar 5 columns and SourceId is not null so group by should be on 6 columns . 现在它具有PeridoId 297和SourceId 180,181,182的三个组合,因此将有三行。因为297和181具有相似的5列,并且SourceId不为null,所以group by应该在6列上。 and so we will have two unique records based on latest timestamp .
因此,我们将基于最新的时间戳有两个唯一的记录。 Same way 297 and 180 does not have SourceId null so group by on 6 columns and latest by Timestamp .
同样的方法297和180没有SourceId null,因此按6列分组,按Timestamp最新。
and similarly 297 182 has three similar rows but it has SourceId null so group by will be on 5 columns and need to get latest . 同样297 182具有3个相似的行,但SourceId为null,因此group by将位于5列上,并且需要获取最新。
So here is the final output for 297 所以这是297的最终输出
192730230775 297 181 INC 500186 1 UpdateReason2Update 505074 3019680 I|!| Japan 2017 2018-05-10T10:00:40+00:00
192730230775 297 180 INC 500186 6 InsertUpdateReason 505074 3019685 I|!| Japan 2017 2018-05-10T10:00:40+00:00
192730230775 297 181 INC 500186 4 New Reason Added 505074 3019683 I|!| Japan 2017 2018-05-10T10:08:01+00:00
192730230775 297 182 INC 500186 null null null null O|!| Japan 2017 2018-05-10T10:11:15+00:00
Here is my code which does same thing except the last logic 这是我的代码,除了最后一个逻辑外,它执行相同的操作
import org.apache.spark.sql.expressions._ import org.apache.spark.sql.functions._ 导入org.apache.spark.sql.expressions._导入org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId")
val windowSpec2 = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "group").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").desc)
def containsActionUdf = udf {
(array: Seq[String]) => (array.contains("O|!|") || array.contains("D|!|"))
}
val latestForEachKey2 = tempReorder.withColumn("group", when(containsActionUdf(collect_list("FFAction|!|").over(windowSpec)) && ($"UpdateReason_updateReasonId" === "null") , lit("same")).otherwise($"UpdateReason_updateReasonId"))
.withColumn("rank", row_number().over(windowSpec2))
.filter($"rank" === 1).drop("rank", "group")
Here is the output that i am getting which has one extra row . 这是我得到的输出,其中多了一行。
+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+-------------------------+
|uniqueFundamentalSet|PeriodId|SourceId|StatementTypeCode|StatementCurrencyId|UpdateReason_updateReasonId|UpdateReasonComment |UpdateReasonComment_languageId|UpdateReasonEnumerationId|FFAction|!||DataPartition|PartitionYear|TimeStamp |
+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+-------------------------+
|192730230775 |297 |181 |INC |500186 |1 |UpdateReason2UpdateIsNowUPdated |505074 |3019680 |I|!| |Japan |2017 |2018-05-10T10:08:01+00:00|
|192730230775 |297 |181 |INC |500186 |4 |New Reason Added |505074 |3019683 |I|!| |Japan |2017 |2018-05-10T10:08:01+00:00|
|192730230775 |308 |179 |BAL |500186 |null |null |null |null |O|!| |Japan |2017 |2018-05-10T09:27:11+00:00|
|192730230775 |298 |181 |BAL |500186 |6 |ReasonToDeleteRevised |505074 |3019685 |I|!| |Japan |2017 |2018-05-10T10:17:37+00:00|
|192730230775 |298 |181 |BAL |500186 |null |null |null |null |O|!| |Japan |2017 |2018-05-10T10:22:55+00:00|
|192730230775 |297 |182 |INC |500186 |6 |UpdateReasonToDelete |505074 |3019685 |I|!| |Japan |2017 |2018-05-10T10:00:40+00:00|
|192730230775 |297 |182 |INC |500186 |null |null |null |null |O|!| |Japan |2017 |2018-05-10T10:11:15+00:00|
|192730230775 |308 |180 |BAL |500186 |1 |RevisedReasonAdded |505074 |3019680 |O|!| |Japan |2017 |2018-05-10T10:27:09+00:00|
|192730230775 |308 |180 |BAL |500186 |6 |UpdateReasonToUpdateRevisedisNowUpdated|505074 |3019685 |O|!| |Japan |2017 |2018-05-10T10:27:09+00:00|
|192730230775 |310 |181 |INC |500186 |null |null |null |null |D|!| |Japan |9999 |2018-05-10T08:21:26+00:00|
|192730230775 |308 |181 |BAL |500186 |6 |ReasonToDeleteRevised |505074 |3019685 |I|!| |Japan |2017 |2018-05-10T10:17:37+00:00|
|192730230775 |308 |181 |BAL |500186 |null |null |null |null |O|!| |Japan |2017 |2018-05-10T10:27:09+00:00|
|192730230775 |298 |180 |BAL |500186 |1 |RevisedReasonAdded |505074 |3019680 |I|!| |Japan |2017 |2018-05-10T10:22:55+00:00|
|192730230775 |298 |180 |BAL |500186 |6 |UpdateReasonToUpdateRevisedisNowUpdated|505074 |3019685 |I|!| |Japan |2017 |2018-05-10T10:22:55+00:00|
|192730230775 |312 |181 |BAL |500186 |null |null |null |null |O|!| |Japan |2018 |2018-05-10T09:39:43+00:00|
|192730230775 |310 |182 |INC |500186 |null |null |null |null |O|!| |Japan |2018 |2018-05-10T08:30:53+00:00|
|192730230775 |297 |180 |INC |500186 |6 |InsertUpdateReason |505074 |3019685 |I|!| |Japan |2017 |2018-05-10T10:00:40+00:00|
+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+-------------------------+
Like that the Final output should be .. Final Output .. 这样,最终输出应该是.. 最终输出..
192730230775 297 181 INC 500186 1 UpdateReason2Update 505074 3019680 I|!| Japan 2017 2018-05-10T10:00:40+00:00
192730230775 297 180 INC 500186 6 InsertUpdateReason 505074 3019685 I|!| Japan 2017 2018-05-10T10:00:40+00:00
192730230775 297 181 INC 500186 4 New Reason Added 505074 3019683 I|!| Japan 2017 2018-05-10T10:08:01+00:00
192730230775 297 182 INC 500186 null null null null O|!| Japan 2017 2018-05-10T10:11:15+00:00
192730230775 308 179 BAL 500186 null null null null O|!| Japan 2017 2018-05-10T09:27:11+00:00
192730230775 308 181 BAL 500186 null null null null O|!| Japan 2017 2018-05-10T10:27:09+00:00
192730230775 308 180 BAL 500186 6 UpdateReasonToUpdateRevisedisNowUpdated 505074 3019685 O|!| Japan 2017 2018-05-10T10:22:55+00:00
192730230775 308 180 BAL 500186 1 RevisedReasonAdded 505074 3019680 O|!| Japan 2017 2018-05-10T10:27:09+00:00
192730230775 298 180 BAL 500186 6 UpdateReasonToUpdateRevised 505074 3019685 I|!| Japan 2017 2018-05-10T10:16:31+00:00
192730230775 298 180 BAL 500186 1 RevisedReasonAdded 505074 3019680 I|!| Japan 2017 2018-05-10T10:22:55+00:00
192730230775 298 181 BAL 500186 null null null null O|!| Japan 2017 2018-05-10T10:22:55+00:00
192730230775 312 181 BAL 500186 null null null null O|!| Japan 2018 2018-05-10T09:39:43+00:00
192730230775 310 181 INC 500186 null null null null D|!| Japan 9999 2018-05-10T08:21:26+00:00
192730230775 310 182 INC 500186 null null null null O|!| Japan 2018 2018-05-10T08:30:53+00:00
After understanding your logics, it seems that you have been checking the wrong columns in udf
function. 了解了逻辑之后, 似乎您正在检查
udf
函数中的错误列 。 You should be checking UpdateReason_updateReasonId
for nulls as following 您应该按如下所示检查
UpdateReason_updateReasonId
是否为null
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
//window for checking if O|!| is present in the group
val windowSpec = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId")
//window for filtering out the latest after applying the group defined in previous window
val windowSpec2 = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "group").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").desc)
//udf to check if the group has O|!| or not
def containsUdf = udf{(array: Seq[String])=> array.contains("null") || array.contains("NULL") || array.contains(null)}
//applying the window and udf functions and filtering in the latest
val latestForEachKey1 = tempReorder.withColumn("group", when(containsUdf(collect_list("UpdateReason_updateReasonId").over(windowSpec)), lit("same")).otherwise($"UpdateReason_updateReasonId"))
.withColumn("rank", row_number().over(windowSpec2))
.filter($"rank" === 1).drop("rank", "group")
latestForEachKey1.show(false)
which should give you 这应该给你
+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+--------------------------+
|uniqueFundamentalSet|PeriodId|SourceId|StatementTypeCode|StatementCurrencyId|UpdateReason_updateReasonId|UpdateReasonComment |UpdateReasonComment_languageId|UpdateReasonEnumerationId|FFAction|!||DataPartition|PartitionYear|TimeStamp |
+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+--------------------------+
|192730230775 |297 |181 |INC |500186 |1 |UpdateReason2UpdateIsNowUPdated |505074 |3019680 |I|!| |Japan |2017 |2018-05-10T10:08:01+00:00 |
|192730230775 |297 |181 |INC |500186 |4 |New Reason Added |505074 |3019683 |I|!| |Japan |2017 |2018-05-10T10:08:01+00:00 |
|192730230775 |308 |179 |BAL |500186 |null |null |null |null |O|!| |Japan |2017 |2018-05-10T09:27:11+00:00 |
|192730230775 |298 |181 |BAL |500186 |null |null |null |null |O|!| |Japan |2017 |2018-05-10T10:22:55+00:00 |
|192730230775 |297 |182 |INC |500186 |null |null |null |null |O|!| |Japan |2017 |2018-05-10T10:11:15+00:00 |
|192730230775 |308 |180 |BAL |500186 |1 |RevisedReasonAdded |505074 |3019680 |O|!| |Japan |2017 |2018-05-10T10:27:09+00:00 |
|192730230775 |308 |180 |BAL |500186 |6 |UpdateReasonToUpdateRevisedisNowUpdated|505074 |3019685 |O|!| |Japan |2017 |2018-05-10T10:27:09+00:000|
|192730230775 |310 |181 |INC |500186 |null |null |null |null |D|!| |Japan |9999 |2018-05-10T08:21:26+00:00 |
|192730230775 |308 |181 |BAL |500186 |null |null |null |null |O|!| |Japan |2017 |2018-05-10T10:27:09+00:00 |
|192730230775 |298 |180 |BAL |500186 |1 |RevisedReasonAdded |505074 |3019680 |I|!| |Japan |2017 |2018-05-10T10:22:55+00:00 |
|192730230775 |298 |180 |BAL |500186 |6 |UpdateReasonToUpdateRevisedisNowUpdated|505074 |3019685 |I|!| |Japan |2017 |2018-05-10T10:21:50+00:000|
|192730230775 |312 |181 |BAL |500186 |null |null |null |null |O|!| |Japan |2018 |2018-05-10T09:39:43+00:00 |
|192730230775 |310 |182 |INC |500186 |null |null |null |null |O|!| |Japan |2018 |2018-05-10T08:30:53+00:00 |
|192730230775 |297 |180 |INC |500186 |6 |InsertUpdateReason |505074 |3019685 |I|!| |Japan |2017 |2018-05-10T10:00:40+00:00 |
+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+--------------------------+
I guess that is the expected result. 我想这是预期的结果。 I hope the answer is helpful
我希望答案是有帮助的
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.