简体   繁体   中英

How to pass columns as value in UDF in Spark Scala for checking condition

Here is my data frame

uniqueFundamentalSet    PeriodId    SourceId    StatementTypeCode   StatementCurrencyId UpdateReason_updateReasonId UpdateReasonComment UpdateReasonComment_languageId  UpdateReasonEnumerationId   FFAction|!| DataPartition   PartitionYear   TimeStamp
192730230775    297 182 INC 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T10:11:15+00:00
192730230775    297 180 INC 500186  6   InsertUpdateReason  505074  3019685 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 181 INC 500186  1   UpdateReason2Update 505074  3019680 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 182 INC 500186  6   UpdateReasonToDelete    505074  3019685 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 181 INC 500186  1   UpdateReason2UpdateIsNowUPdated 505074  3019680 I|!|    Japan   2017    2018-05-10T10:08:01+00:00
192730230775    297 181 INC 500186  4   New Reason Added    505074  3019683 I|!|    Japan   2017    2018-05-10T10:08:01+00:00
192730230775    297 180 INC 500186  6   InsertUpdateReason  505074  3019685 I|!|    Japan   2017    2018-05-10T09:57:29+00:00
192730230775    297 181 INC 500186  1   UpdateReason2Update 505074  3019680 I|!|    Japan   2017    2018-05-10T09:57:29+00:00
192730230775    297 182 INC 500186  6   UpdateReasonToDelete    505074  3019685 I|!|    Japan   2017    2018-05-10T09:57:29+00:00
192730230775    308 180 BAL 500186  1   RevisedReasonAdded  505074  3019680 O|!|    Japan   2017    2018-05-10T10:21:50+00:00
192730230775    308 180 BAL 500186  6   UpdateReasonToUpdateRevisedisNowUpdated 505074  3019685 O|!|    Japan   2017    2018-05-10T10:21:50+00:00
192730230775    308 180 BAL 500186  1   RevisedReasonAdded  505074  3019680 O|!|    Japan   2017    2018-05-10T10:27:09+00:00
192730230775    308 180 BAL 500186  6   UpdateReasonToUpdateRevisedisNowUpdated 505074  3019685 O|!|    Japan   2017    2018-05-10T10:27:09+00:00
192730230775    308 179 BAL 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T09:27:11+00:00
192730230775    308 181 BAL 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T10:27:09+00:00
192730230775    308 180 BAL 500186  1   RevisedReasonAdded  505074  3019680 O|!|    Japan   2017    2018-05-10T10:22:55+00:00
192730230775    308 180 BAL 500186  6   UpdateReasonToUpdateRevisedisNowUpdated 505074  3019685 O|!|    Japan   2017    2018-05-10T10:22:55+00:00
192730230775    308 180 BAL 500186  6   UpdateReasonToUpdateRevised 505074  3019685 I|!|    Japan   2017    2018-05-10T10:17:37+00:00
192730230775    308 181 BAL 500186  6   ReasonToDeleteRevised   505074  3019685 I|!|    Japan   2017    2018-05-10T10:17:37+00:00
192730230775    298 180 BAL 500186  6   UpdateReasonToUpdateRevised 505074  3019685 I|!|    Japan   2017    2018-05-10T10:17:37+00:00
192730230775    298 181 BAL 500186  6   ReasonToDeleteRevised   505074  3019685 I|!|    Japan   2017    2018-05-10T10:17:37+00:00
192730230775    298 180 BAL 500186  1   RevisedReasonAdded  505074  3019680 I|!|    Japan   2017    2018-05-10T10:22:55+00:00
192730230775    298 180 BAL 500186  6   UpdateReasonToUpdateRevisedisNowUpdated 505074  3019685 I|!|    Japan   2017    2018-05-10T10:22:55+00:00
192730230775    298 180 BAL 500186  6   UpdateReasonToUpdateRevised 505074  3019685 I|!|    Japan   2017    2018-05-10T10:16:31+00:00
192730230775    298 181 BAL 500186  6   ReasonToDeleteRevised   505074  3019685 I|!|    Japan   2017    2018-05-10T10:16:31+00:00
192730230775    298 180 BAL 500186  1   RevisedReasonAdded  505074  3019680 I|!|    Japan   2017    2018-05-10T10:21:50+00:00
192730230775    298 180 BAL 500186  6   UpdateReasonToUpdateRevisedisNowUpdated 505074  3019685 I|!|    Japan   2017    2018-05-10T10:21:50+00:00
192730230775    312 181 BAL 500186  null    null    null    null    O|!|    Japan   2018    2018-05-10T09:39:43+00:00
192730230775    310 181 INC 500186  null    null    null    null    D|!|    Japan   9999    2018-05-10T08:21:26+00:00
192730230775    310 182 INC 500186  null    null    null    null    O|!|    Japan   2018    2018-05-10T08:30:53+00:00
192730230775    298 181 BAL 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T10:22:55+00:00

Here is logic to get by expected output

If "FFAction|!|" === "I|!|" then group by first 6 columns and need to get latest based on Timestamp.

If If "FFAction|!|" === "O|!|" and $"UpdateReason_updateReasonId" === "null" or "FFAction|!|" === "D|!|" then group by first 5 columns and need to get latest based on Timestamp.

If one row "FFAction|!|" === "I|!|" and another "FFAction|!|" === "O|!|" in that case group by first five columns and need to get latest .

Same as If one row "FFAction|!|" === "I|!|" and another "FFAction|!|" === "D|!|" in that case group by first five columns and need to get latest .

Here is my expected output with explained logic .

Logic Example 1:

Lets take example of PeridoId 308 it has total 11 rows . Now one row has PeriodId 308 and SourceId 179 and it is totaly different so it will be in output . 308 and 181 has two rows identical till columns 5 and out of that one has O so we need to group by 5 columns and take latest and latest should be Lastly 308 and 180 has 7 columns similar till row 5 and it does not have UpdateReason_updateReasonId as null in that case group by should be on 6 columns .

And in that way latest will be

192730230775    308 179 BAL 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T09:27:11+00:00
192730230775    308 181 BAL 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T10:27:09+00:00
192730230775    308 180 BAL 500186  6   UpdateReasonToUpdateRevisedisNowUpdated 505074  3019685 O|!|    Japan   2017    2018-05-10T10:22:55+00:00
192730230775    308 180 BAL 500186  1   RevisedReasonAdded  505074  3019680 O|!|    Japan   2017    2018-05-10T10:27:09+00:00

So this should be the final output for PeriodId 308 .

Logic Example 2 :

Similary PeriodId 297 has 9 columns .

Now it has three combination PeridoId 297 with SourceId 180,181,182 So there will be three rows .In that 297 and 181 has similar 5 columns and SourceId is not null so group by should be on 6 columns . and so we will have two unique records based on latest timestamp . Same way 297 and 180 does not have SourceId null so group by on 6 columns and latest by Timestamp .

and similarly 297 182 has three similar rows but it has SourceId null so group by will be on 5 columns and need to get latest .

So here is the final output for 297

192730230775    297 181 INC 500186  1   UpdateReason2Update 505074  3019680 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 180 INC 500186  6   InsertUpdateReason  505074  3019685 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 181 INC 500186  4   New Reason Added    505074  3019683 I|!|    Japan   2017    2018-05-10T10:08:01+00:00
192730230775    297 182 INC 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T10:11:15+00:00

Here is my code which does same thing except the last logic

import org.apache.spark.sql.expressions._ import org.apache.spark.sql.functions._

val windowSpec = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId")

val windowSpec2 = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "group").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").desc)


def containsActionUdf = udf {

  (array: Seq[String]) => (array.contains("O|!|") || array.contains("D|!|"))
}

val latestForEachKey2 = tempReorder.withColumn("group", when(containsActionUdf(collect_list("FFAction|!|").over(windowSpec)) && ($"UpdateReason_updateReasonId" === "null") , lit("same")).otherwise($"UpdateReason_updateReasonId"))
  .withColumn("rank", row_number().over(windowSpec2))
  .filter($"rank" === 1).drop("rank", "group")

Here is the output that i am getting which has one extra row .

        +--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+-------------------------+
|uniqueFundamentalSet|PeriodId|SourceId|StatementTypeCode|StatementCurrencyId|UpdateReason_updateReasonId|UpdateReasonComment                    |UpdateReasonComment_languageId|UpdateReasonEnumerationId|FFAction|!||DataPartition|PartitionYear|TimeStamp                |
+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+-------------------------+
|192730230775        |297     |181     |INC              |500186             |1                          |UpdateReason2UpdateIsNowUPdated        |505074                        |3019680                  |I|!|       |Japan        |2017         |2018-05-10T10:08:01+00:00|
|192730230775        |297     |181     |INC              |500186             |4                          |New Reason Added                       |505074                        |3019683                  |I|!|       |Japan        |2017         |2018-05-10T10:08:01+00:00|
|192730230775        |308     |179     |BAL              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2017         |2018-05-10T09:27:11+00:00|
|192730230775        |298     |181     |BAL              |500186             |6                          |ReasonToDeleteRevised                  |505074                        |3019685                  |I|!|       |Japan        |2017         |2018-05-10T10:17:37+00:00|
|192730230775        |298     |181     |BAL              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2017         |2018-05-10T10:22:55+00:00|
|192730230775        |297     |182     |INC              |500186             |6                          |UpdateReasonToDelete                   |505074                        |3019685                  |I|!|       |Japan        |2017         |2018-05-10T10:00:40+00:00|
|192730230775        |297     |182     |INC              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2017         |2018-05-10T10:11:15+00:00|
|192730230775        |308     |180     |BAL              |500186             |1                          |RevisedReasonAdded                     |505074                        |3019680                  |O|!|       |Japan        |2017         |2018-05-10T10:27:09+00:00|
|192730230775        |308     |180     |BAL              |500186             |6                          |UpdateReasonToUpdateRevisedisNowUpdated|505074                        |3019685                  |O|!|       |Japan        |2017         |2018-05-10T10:27:09+00:00|
|192730230775        |310     |181     |INC              |500186             |null                       |null                                   |null                          |null                     |D|!|       |Japan        |9999         |2018-05-10T08:21:26+00:00|
|192730230775        |308     |181     |BAL              |500186             |6                          |ReasonToDeleteRevised                  |505074                        |3019685                  |I|!|       |Japan        |2017         |2018-05-10T10:17:37+00:00|
|192730230775        |308     |181     |BAL              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2017         |2018-05-10T10:27:09+00:00|
|192730230775        |298     |180     |BAL              |500186             |1                          |RevisedReasonAdded                     |505074                        |3019680                  |I|!|       |Japan        |2017         |2018-05-10T10:22:55+00:00|
|192730230775        |298     |180     |BAL              |500186             |6                          |UpdateReasonToUpdateRevisedisNowUpdated|505074                        |3019685                  |I|!|       |Japan        |2017         |2018-05-10T10:22:55+00:00|
|192730230775        |312     |181     |BAL              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2018         |2018-05-10T09:39:43+00:00|
|192730230775        |310     |182     |INC              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2018         |2018-05-10T08:30:53+00:00|
|192730230775        |297     |180     |INC              |500186             |6                          |InsertUpdateReason                     |505074                        |3019685                  |I|!|       |Japan        |2017         |2018-05-10T10:00:40+00:00|
+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+-------------------------+

Like that the Final output should be .. Final Output ..

192730230775    297 181 INC 500186  1   UpdateReason2Update 505074  3019680 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 180 INC 500186  6   InsertUpdateReason  505074  3019685 I|!|    Japan   2017    2018-05-10T10:00:40+00:00
192730230775    297 181 INC 500186  4   New Reason Added    505074  3019683 I|!|    Japan   2017    2018-05-10T10:08:01+00:00
192730230775    297 182 INC 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T10:11:15+00:00
192730230775    308 179 BAL 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T09:27:11+00:00
192730230775    308 181 BAL 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T10:27:09+00:00
192730230775    308 180 BAL 500186  6   UpdateReasonToUpdateRevisedisNowUpdated 505074  3019685 O|!|    Japan   2017    2018-05-10T10:22:55+00:00
192730230775    308 180 BAL 500186  1   RevisedReasonAdded  505074  3019680 O|!|    Japan   2017    2018-05-10T10:27:09+00:00
192730230775    298 180 BAL 500186  6   UpdateReasonToUpdateRevised 505074  3019685 I|!|    Japan   2017    2018-05-10T10:16:31+00:00
192730230775    298 180 BAL 500186  1   RevisedReasonAdded  505074  3019680 I|!|    Japan   2017    2018-05-10T10:22:55+00:00
192730230775    298 181 BAL 500186  null    null    null    null    O|!|    Japan   2017    2018-05-10T10:22:55+00:00
192730230775    312 181 BAL 500186  null    null    null    null    O|!|    Japan   2018    2018-05-10T09:39:43+00:00
192730230775    310 181 INC 500186  null    null    null    null    D|!|    Japan   9999    2018-05-10T08:21:26+00:00
192730230775    310 182 INC 500186  null    null    null    null    O|!|    Japan   2018    2018-05-10T08:30:53+00:00

After understanding your logics, it seems that you have been checking the wrong columns in udf function. You should be checking UpdateReason_updateReasonId for nulls as following

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

//window for checking if O|!| is present in the group
val windowSpec = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId")
//window for filtering out the latest after applying the group defined in previous window
val windowSpec2 = Window.partitionBy("uniqueFundamentalSet", "PeriodId", "SourceId", "StatementTypeCode", "StatementCurrencyId", "group").orderBy(unix_timestamp($"TimeStamp", "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").desc)

//udf to check if the group has O|!| or not
def containsUdf = udf{(array: Seq[String])=> array.contains("null") || array.contains("NULL") || array.contains(null)}

//applying the window and udf functions and filtering in the latest
val latestForEachKey1 = tempReorder.withColumn("group", when(containsUdf(collect_list("UpdateReason_updateReasonId").over(windowSpec)), lit("same")).otherwise($"UpdateReason_updateReasonId"))
                                    .withColumn("rank", row_number().over(windowSpec2))
                                    .filter($"rank" === 1).drop("rank", "group")
latestForEachKey1.show(false)

which should give you

+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+--------------------------+
|uniqueFundamentalSet|PeriodId|SourceId|StatementTypeCode|StatementCurrencyId|UpdateReason_updateReasonId|UpdateReasonComment                    |UpdateReasonComment_languageId|UpdateReasonEnumerationId|FFAction|!||DataPartition|PartitionYear|TimeStamp                 |
+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+--------------------------+
|192730230775        |297     |181     |INC              |500186             |1                          |UpdateReason2UpdateIsNowUPdated        |505074                        |3019680                  |I|!|       |Japan        |2017         |2018-05-10T10:08:01+00:00 |
|192730230775        |297     |181     |INC              |500186             |4                          |New Reason Added                       |505074                        |3019683                  |I|!|       |Japan        |2017         |2018-05-10T10:08:01+00:00 |
|192730230775        |308     |179     |BAL              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2017         |2018-05-10T09:27:11+00:00 |
|192730230775        |298     |181     |BAL              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2017         |2018-05-10T10:22:55+00:00 |
|192730230775        |297     |182     |INC              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2017         |2018-05-10T10:11:15+00:00 |
|192730230775        |308     |180     |BAL              |500186             |1                          |RevisedReasonAdded                     |505074                        |3019680                  |O|!|       |Japan        |2017         |2018-05-10T10:27:09+00:00 |
|192730230775        |308     |180     |BAL              |500186             |6                          |UpdateReasonToUpdateRevisedisNowUpdated|505074                        |3019685                  |O|!|       |Japan        |2017         |2018-05-10T10:27:09+00:000|
|192730230775        |310     |181     |INC              |500186             |null                       |null                                   |null                          |null                     |D|!|       |Japan        |9999         |2018-05-10T08:21:26+00:00 |
|192730230775        |308     |181     |BAL              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2017         |2018-05-10T10:27:09+00:00 |
|192730230775        |298     |180     |BAL              |500186             |1                          |RevisedReasonAdded                     |505074                        |3019680                  |I|!|       |Japan        |2017         |2018-05-10T10:22:55+00:00 |
|192730230775        |298     |180     |BAL              |500186             |6                          |UpdateReasonToUpdateRevisedisNowUpdated|505074                        |3019685                  |I|!|       |Japan        |2017         |2018-05-10T10:21:50+00:000|
|192730230775        |312     |181     |BAL              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2018         |2018-05-10T09:39:43+00:00 |
|192730230775        |310     |182     |INC              |500186             |null                       |null                                   |null                          |null                     |O|!|       |Japan        |2018         |2018-05-10T08:30:53+00:00 |
|192730230775        |297     |180     |INC              |500186             |6                          |InsertUpdateReason                     |505074                        |3019685                  |I|!|       |Japan        |2017         |2018-05-10T10:00:40+00:00 |
+--------------------+--------+--------+-----------------+-------------------+---------------------------+---------------------------------------+------------------------------+-------------------------+-----------+-------------+-------------+--------------------------+

I guess that is the expected result. I hope the answer is helpful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM