简体   繁体   English

Spark Scala:添加新列而不循环多次抛出同一个表

[英]Spark Scala : Add new column without looping throw same table many times

I want to add a new column called activeIOsAtSite that has a value of ' Y ' or ' N '.我想添加一个名为activeIOsAtSite的新列,其值为“ Y ”或“ N ”。

the value will be added according to the following condition:该值将根据以下条件添加:

For each column SITE_SITE_ID in row of the table bellow;对于下表行中的每一列SITE_SITE_ID if the SITE_SITE_ID has:如果SITE_SITE_ID具有:

(APPLICATION_SOURCE == 'CIBASE' AND STANDARD_STATUS =='ACTIVE')

In this case the value should be 'Y' otherwise 'N'在这种情况下,该值应为“Y”,否则为“N”

IS there a was to do that without iterate on the same table over and over again (for each row), because the table I have is so big and I need to do that in the fastest way possible?是否有一种方法可以做到这一点,而无需一遍又一遍地迭代同一个表(对于每一行),因为我拥有的表是如此之,我需要以最快的方式做到这一点?

Example of desired result:期望结果的示例:

在此处输入图像描述

I tried to do something like that, but I'm not sure if it is correct:我试图做类似的事情,但我不确定它是否正确:

finvInventoryAllDf
  .withColumn(
    "activeIOsAtSite",
    activeIOsAtSiteGenerator(finvInventoryAllDf, col("Site_siteId"))
  )

with activeIOsAtSiteGenerator() is a function where I verify the conditions above:使用activeIOsAtSiteGenerator()是 function 我验证上述条件:

  def activeIOsAtSiteGenerator(dataFrame: DataFrame, site_siteId: Column): Column = {
    val count = dataFrame
      .where(col("Site_siteId") === site_siteId)
      .where("InstalledOffer_installedOfferId IS NOT NULL AND InstalledOffer_installedOfferId NOT IN ('','null','NULL') AND UPPER(InstalledOffer_standardStatus) IN ('ACTIVE') AND UPPER(InstalledOffer_applicationSource) IN('CIBASE')")
      .count()
    if (count > 0)
      lit("Y")
    else
      lit("N")
  }

You can first groupBy the unique ID, then collect_set to check whether the column contains any of the combo you mentioned.您可以先groupBy唯一 ID,然后collect_set检查该列是否包含您提到的任何组合。

var grouped = df
  .groupBy("SITE_SITE_ID").agg(collect_set(array("APPLICATION_SOURCE", "STANDARD_STATUS")).as("array"))
  .withColumn("indicator",
    expr("transform(array, x -> array_contains(x, 'CIBASE') and array_contains(x, 'ACTIVE'))")
  )

In case order matters:如果订单很重要:

.withColumn("indicator",
    expr("transform(array, x -> lower(element_at(x, 1)) = 'cibase' and lower(element_at(x, 2)) = 'active')")
)

Current form of what we have:我们现有的形式:

+------------+------------------------------------+-------------+
|SITE_SITE_ID|array                               |indicator    |
+------------+------------------------------------+-------------+
|si_2        |[[SLOW, STASH]]                     |[false]      | <- make this N
|si_3        |[[MEDIUM, TREE]]                    |[false]      | <- make this N
|si_1        |[[FAST, DISABLED], [CIBASE, ACTIVE]]|[false, true]| <- make this Y (pair found)
+------------+------------------------------------+-------------+

Then we move on:然后我们继续:

  grouped = grouped.withColumn("indicator",
    when(array_contains(col("indicator"), true), "Y").otherwise("N")
  )
  .drop("array")
+------------+---------+
|SITE_SITE_ID|indicator|
+------------+---------+
|si_2        |N        |
|si_3        |N        |
|si_1        |Y        |
+------------+---------+

The collected_set returns an array of arrays, that is why we check for the combo, and we check again, if there is one true within the array (the combo has been found), return Y otherwise N ; collected_set返回一个arrays数组,这就是我们检查combo的原因,我们再次检查,如果数组中有一个为true(找到了combo),则返回Y ,否则返回N finally, we drop array column.最后,我们删除array列。

Grouped's sample:分组样本:

+------------+---------+
|SITE_SITE_ID|indicator|
+------------+---------+
|si_2        |N        |
|si_3        |N        |
|si_1        |Y        |
+------------+---------+

Finally, we join our main table with grouped :最后,我们用grouped加入我们的主表:

df.join(grouped, Seq("SITE_SITE_ID"))

Final result:最后结果:

+------------+-----+------------------+---------------+---------+
|SITE_SITE_ID|IR_ID|APPLICATION_SOURCE|STANDARD_STATUS|indicator|
+------------+-----+------------------+---------------+---------+
|si_2        |ir2  |SLOW              |STASH          |N        |
|si_3        |ir3  |MEDIUM            |TREE           |N        |
|si_1        |ir1  |FAST              |DISABLED       |Y        |
|si_1        |ir4  |CIBASE            |ACTIVE         |Y        |
+------------+-----+------------------+---------------+---------+

Good luck!祝你好运!

@vilalabinot answer is hundred per cent correct: but I had to improve it using the same logic with an udf function instead: @vilalabinot 答案百分百正确:但我必须使用与 udf function 相同的逻辑来改进它:

so we still need to group by SITE_SITE_ID:所以我们仍然需要按 SITE_SITE_ID 分组:

val grouped0 = finvInventoryAllDf
  .groupBy("SITE_SITE_ID")
  .agg(
    collect_set(
      array(
        "InstalledOffer_applicationSource",
        "InstalledOffer_standardStatus", "InstalledOffer_installedOfferId"
      )
    ).as("array")
  ) 

But to create the new field I use a udf that will directly give me Y or N value according to the desired conditions, so:但是要创建新字段,我使用 udf 会根据所需条件直接给我 Y 或 N 值,所以:

grouped0
.withColumn("activeIOsAtSite", buildFieldActiveIOsAtSite_UDF(col("array")))
.drop("array") 

The declaration of UDF I'm using:我正在使用的 UDF 声明:

val buildFieldActiveIOsAtSite_UDF = udf(buildFieldActiveIOsAtSite _)

And finally the function of this udf:最后是这个udf的function:

  def buildFieldActiveIOsAtSite(rows: mutable.WrappedArray[mutable.WrappedArray[String]]): String = {
    var yesOrNoCondition = "N";

    breakable {
      rows.array.foreach(r => {
        val installedOffer_applicationSource = Option(r(0)).getOrElse("")
        val installedOffer_standardStatus = Option(r(1)).getOrElse("")
        val InstalledOffer_installedOfferId = Option(r(2)).getOrElse("")


        val yesCondition = InstalledOffer_installedOfferId.nonEmpty &&
          !InstalledOffer_installedOfferId.equalsIgnoreCase("null") &&
          installedOffer_standardStatus.equalsIgnoreCase("active") &&
          installedOffer_applicationSource.equalsIgnoreCase("CIBASE")
        if (yesCondition) {
          yesOrNoCondition = "Y"
          break
        }
      })

    }
    yesOrNoCondition
  }

Now all what remains to do is a small join with the main dataframe:现在剩下要做的就是与主 dataframe 进行小连接:

val finvResultOutput1 = finvInventoryAllDf.join(grouped0, Seq("SITE_SITE_ID"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark,在Scala中添加具有相同值的新列 - Spark, add new Column with the same value in Scala Spark Scala:多次查询同一张表 - Spark Scala: Querying same table multiple times 如何在 scala 中不覆盖的情况下将新列添加到增量表中 - How to add a new column into delta table without overwrite in scala Spark Scala DF。 在处理同一列的某些行时将新列添加到DF - Spark Scala DF. add a new Column to DF based in processing of some rows of the same column 将具有文字值的新列添加到 Spark Scala 中 Dataframe 中的结构列 - Add new column with literal value to a struct column in Dataframe in Spark Scala 如何基于Spark Scala中的现有列添加新列 - How add new column based on existing column in spark scala 从 scala 中同一表的列中循环 dataframe - Looping a dataframe from the column from the same table in scala 将 Map Datatype 的新列添加到 Scala 中的 Spark Dataframe - Add new column of Map Datatype to Spark Dataframe in scala 从Spark Scala Shell将列族添加到现有的hbase表中 - Add column family to the existing hbase table from spark scala shell 在Spark DataFrame中添加一个新列,其中包含一个列的所有值的总和-Scala / Spark - Add a new Column in Spark DataFrame which contains the sum of all values of one column-Scala/Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM