简体   繁体   English

Spark Scala - 通过有条件检查将新列添加到数据帧/数据<n>其他列数</n>

[英]Spark Scala - add new column to dataframe/data by conditionally checking <N> number of other coliumns

Here is the scenario from legacy code need to converted to spark scala.这是遗留代码需要转换为 spark scala 的场景。 Any pointers will be highly appreciated.任何指针将不胜感激。

Scenario: Need to add a new column to a dataframe/dataset using "withColumn" based on conditionally checking 20 - 22 other columns values and then derive this new column value.场景:需要基于有条件地检查 20 - 22 个其他列值,使用“withColumn”向数据框/数据集添加一个新列,然后派生这个新列值。 Any pointers on how to implement this in spark Scala?有关如何在 spark Scala 中实现此功能的任何指示? Thanks much.非常感谢。 I have tried using UDF and passing a map of 22 column as key:value and If else checking with mutable variables but was informed by the experts from this forum that's not recommended so seeking guidance on what is the right way to achieve this?我尝试使用 UDF 并将 22 列的 map 作为 key:value 和 If else 检查可变变量,但从该论坛的专家那里得到了不推荐的通知,因此寻求有关实现此目标的正确方法的指导?

OR using dataset.mappartitions and using the mutable variables inside this function is right way to do it?或者使用 dataset.mappartitions 并使用这个 function 中的可变变量是正确的方法吗?

val calculate = dataset.mapPartitions(partition => partition.map(x => {
      var value1 = "NA"
      var value1  = "NA"
 
 set the values of the mutable variables value1 and value2 based on the column values
 if ( x.fieldA ="xyx")
 {
    value1 = "ABC"
    value2 = "cbz
 }
 eles if (x.fieldA ="112" & x.fieldB ="xy1")
{
    value1 = "zya"
    value2 = "ab"
}

    df(
        x.fldC
    x.fldB
    value1
    value2
      )
    }

case class df(fldc:String,fldb:String,value1:String:value2:String)

Can you please let me know what other details I should provide as I have updated the question above?您能否让我知道我应该提供哪些其他详细信息,因为我已经更新了上述问题?

I am new to this distributed/spark scala development so might be asking basic questions.我是这个分布式/火花 scala 开发的新手,所以可能会问一些基本问题。

import spark.implicits._

import org.apache.spark.sql.functions._

val sourceDF = Seq(
  (1,2,3,4,5,6,7,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", true),
  (11,12,13,14,5,6,7,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", false),
  (1,2,3,4,25,26,27,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", true),
  (1,2,3,4,5,6,7,38,39,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", false),
  (1,2,3,4,5,6,7,8,9,410, "11", "12", "13", "14", "15", "16", "17", "18", "19", true)
).toDF("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10",
  "col11", "col12", "col13", "col14", "col15", "col16", "col17", "col18", "col19", "col20")
  
val resDF = sourceDF
  .withColumn("col_result",
    when(
        'col1.equalTo(1) && 'col2.equalTo(2) && 'col3.equalTo(3) &&
        'col4.equalTo(4) && 'col5.equalTo(5) && 'col6.equalTo(6) &&
        'col7.equalTo(7) && 'col8.equalTo(8) && 'col9.equalTo(9) &&
        'col10.equalTo(10) && 'col11.equalTo("11") && 'col12.equalTo("12") &&
        'col13.equalTo("13") && 'col14.equalTo("14") && 'col15.equalTo("15") &&
        'col16.equalTo("16") && 'col17.equalTo("17") && 'col18.equalTo("18") &&
        'col19.equalTo("19") && 'col20.equalTo(true),"result").otherwise(null))
    
resDF.show(false)
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
//  |col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col_result|
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |result    |
//  |11  |12  |13  |14  |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|null      |
//  |1   |2   |3   |4   |25  |26  |27  |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |null      |
//  |1   |2   |3   |4   |5   |6   |7   |38  |39  |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|null      |
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |410  |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |null      |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+

Variant 2变体 2

val res1DF = sourceDF.withColumn("col_result_1",
  when(col("col8") === 38 || col("col20") === false, "good check" )
  .when(col("col10") === 410 && col("col17") === "17" && col("col20") === true, "next good check")
    .otherwise("we use when  many many")
)

res1DF.show(false)
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
//  |col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col_result_1          |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |we use when  many many|
//  |11  |12  |13  |14  |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|good check            |
//  |1   |2   |3   |4   |25  |26  |27  |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |we use when  many many|
//  |1   |2   |3   |4   |5   |6   |7   |38  |39  |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|good check            |
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |410  |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |next good check       |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM