Spark Scala：添加新列而不循环多次抛出同一个表

Question

I want to add a new column called activeIOsAtSite that has a value of ' Y ' or ' N '.我想添加一个名为activeIOsAtSite的新列，其值为“ Y ”或“ N ”。

the value will be added according to the following condition:该值将根据以下条件添加：

For each column SITE_SITE_ID in row of the table bellow;对于下表行中的每一列SITE_SITE_ID ； if the SITE_SITE_ID has:如果SITE_SITE_ID具有：

(APPLICATION_SOURCE == 'CIBASE' AND STANDARD_STATUS =='ACTIVE')

In this case the value should be 'Y' otherwise 'N'在这种情况下，该值应为“Y”，否则为“N”

IS there a was to do that without iterate on the same table over and over again (for each row), because the table I have is so big and I need to do that in the fastest way possible?是否有一种方法可以做到这一点，而无需一遍又一遍地迭代同一个表（对于每一行），因为我拥有的表是如此之大，我需要以最快的方式做到这一点？

Example of desired result:期望结果的示例：

I tried to do something like that, but I'm not sure if it is correct:我试图做类似的事情，但我不确定它是否正确：

finvInventoryAllDf
  .withColumn(
    "activeIOsAtSite",
    activeIOsAtSiteGenerator(finvInventoryAllDf, col("Site_siteId"))
  )

with activeIOsAtSiteGenerator() is a function where I verify the conditions above:使用activeIOsAtSiteGenerator()是 function 我验证上述条件：

  def activeIOsAtSiteGenerator(dataFrame: DataFrame, site_siteId: Column): Column = {
    val count = dataFrame
      .where(col("Site_siteId") === site_siteId)
      .where("InstalledOffer_installedOfferId IS NOT NULL AND InstalledOffer_installedOfferId NOT IN ('','null','NULL') AND UPPER(InstalledOffer_standardStatus) IN ('ACTIVE') AND UPPER(InstalledOffer_applicationSource) IN('CIBASE')")
      .count()
    if (count > 0)
      lit("Y")
    else
      lit("N")
  }

Answer 1

You can first groupBy the unique ID, then collect_set to check whether the column contains any of the combo you mentioned.您可以先groupBy唯一 ID，然后collect_set检查该列是否包含您提到的任何组合。

var grouped = df
  .groupBy("SITE_SITE_ID").agg(collect_set(array("APPLICATION_SOURCE", "STANDARD_STATUS")).as("array"))
  .withColumn("indicator",
    expr("transform(array, x -> array_contains(x, 'CIBASE') and array_contains(x, 'ACTIVE'))")
  )

In case order matters:如果订单很重要：

.withColumn("indicator",
    expr("transform(array, x -> lower(element_at(x, 1)) = 'cibase' and lower(element_at(x, 2)) = 'active')")
)

Current form of what we have:我们现有的形式：

+------------+------------------------------------+-------------+
|SITE_SITE_ID|array                               |indicator    |
+------------+------------------------------------+-------------+
|si_2        |[[SLOW, STASH]]                     |[false]      | <- make this N
|si_3        |[[MEDIUM, TREE]]                    |[false]      | <- make this N
|si_1        |[[FAST, DISABLED], [CIBASE, ACTIVE]]|[false, true]| <- make this Y (pair found)
+------------+------------------------------------+-------------+

Then we move on:然后我们继续：

  grouped = grouped.withColumn("indicator",
    when(array_contains(col("indicator"), true), "Y").otherwise("N")
  )
  .drop("array")

+------------+---------+
|SITE_SITE_ID|indicator|
+------------+---------+
|si_2        |N        |
|si_3        |N        |
|si_1        |Y        |
+------------+---------+

The collected_set returns an array of arrays, that is why we check for the combo, and we check again, if there is one true within the array (the combo has been found), return Y otherwise N ; collected_set返回一个arrays数组，这就是我们检查combo的原因，我们再次检查，如果数组中有一个为true（找到了combo），则返回Y ，否则返回N ； finally, we drop array column.最后，我们删除array列。

Grouped's sample:分组样本：

+------------+---------+
|SITE_SITE_ID|indicator|
+------------+---------+
|si_2        |N        |
|si_3        |N        |
|si_1        |Y        |
+------------+---------+

Finally, we join our main table with grouped :最后，我们用grouped加入我们的主表：

df.join(grouped, Seq("SITE_SITE_ID"))

Final result:最后结果：

+------------+-----+------------------+---------------+---------+
|SITE_SITE_ID|IR_ID|APPLICATION_SOURCE|STANDARD_STATUS|indicator|
+------------+-----+------------------+---------------+---------+
|si_2        |ir2  |SLOW              |STASH          |N        |
|si_3        |ir3  |MEDIUM            |TREE           |N        |
|si_1        |ir1  |FAST              |DISABLED       |Y        |
|si_1        |ir4  |CIBASE            |ACTIVE         |Y        |
+------------+-----+------------------+---------------+---------+

Good luck!祝你好运！

Answer 2

@vilalabinot answer is hundred per cent correct: but I had to improve it using the same logic with an udf function instead: @vilalabinot 答案百分百正确：但我必须使用与 udf function 相同的逻辑来改进它：

so we still need to group by SITE_SITE_ID:所以我们仍然需要按 SITE_SITE_ID 分组：

val grouped0 = finvInventoryAllDf
  .groupBy("SITE_SITE_ID")
  .agg(
    collect_set(
      array(
        "InstalledOffer_applicationSource",
        "InstalledOffer_standardStatus", "InstalledOffer_installedOfferId"
      )
    ).as("array")
  )

But to create the new field I use a udf that will directly give me Y or N value according to the desired conditions, so:但是要创建新字段，我使用 udf 会根据所需条件直接给我 Y 或 N 值，所以：

grouped0
.withColumn("activeIOsAtSite", buildFieldActiveIOsAtSite_UDF(col("array")))
.drop("array")

The declaration of UDF I'm using:我正在使用的 UDF 声明：

val buildFieldActiveIOsAtSite_UDF = udf(buildFieldActiveIOsAtSite _)

And finally the function of this udf:最后是这个udf的function：

  def buildFieldActiveIOsAtSite(rows: mutable.WrappedArray[mutable.WrappedArray[String]]): String = {
    var yesOrNoCondition = "N";

    breakable {
      rows.array.foreach(r => {
        val installedOffer_applicationSource = Option(r(0)).getOrElse("")
        val installedOffer_standardStatus = Option(r(1)).getOrElse("")
        val InstalledOffer_installedOfferId = Option(r(2)).getOrElse("")


        val yesCondition = InstalledOffer_installedOfferId.nonEmpty &&
          !InstalledOffer_installedOfferId.equalsIgnoreCase("null") &&
          installedOffer_standardStatus.equalsIgnoreCase("active") &&
          installedOffer_applicationSource.equalsIgnoreCase("CIBASE")
        if (yesCondition) {
          yesOrNoCondition = "Y"
          break
        }
      })

    }
    yesOrNoCondition
  }

Now all what remains to do is a small join with the main dataframe:现在剩下要做的就是与主 dataframe 进行小连接：

val finvResultOutput1 = finvInventoryAllDf.join(grouped0, Seq("SITE_SITE_ID"))

Spark Scala：添加新列而不循环多次抛出同一个表

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-09-01 13:05:36

解决方案2
0 2022-09-02 09:48:59

Spark Scala：添加新列而不循环多次抛出同一个表

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-09-01 13:05:36

解决方案2 0 2022-09-02 09:48:59

解决方案1
2 已采纳 2022-09-01 13:05:36

解决方案2
0 2022-09-02 09:48:59