简体   繁体   中英

Spark Scala - add new column to dataframe/data by conditionally checking <N> number of other coliumns

Here is the scenario from legacy code need to converted to spark scala. Any pointers will be highly appreciated.

Scenario: Need to add a new column to a dataframe/dataset using "withColumn" based on conditionally checking 20 - 22 other columns values and then derive this new column value. Any pointers on how to implement this in spark Scala? Thanks much. I have tried using UDF and passing a map of 22 column as key:value and If else checking with mutable variables but was informed by the experts from this forum that's not recommended so seeking guidance on what is the right way to achieve this?

OR using dataset.mappartitions and using the mutable variables inside this function is right way to do it?

val calculate = dataset.mapPartitions(partition => partition.map(x => {
      var value1 = "NA"
      var value1  = "NA"
 
 set the values of the mutable variables value1 and value2 based on the column values
 if ( x.fieldA ="xyx")
 {
    value1 = "ABC"
    value2 = "cbz
 }
 eles if (x.fieldA ="112" & x.fieldB ="xy1")
{
    value1 = "zya"
    value2 = "ab"
}

    df(
        x.fldC
    x.fldB
    value1
    value2
      )
    }

case class df(fldc:String,fldb:String,value1:String:value2:String)

Can you please let me know what other details I should provide as I have updated the question above?

I am new to this distributed/spark scala development so might be asking basic questions.

import spark.implicits._

import org.apache.spark.sql.functions._

val sourceDF = Seq(
  (1,2,3,4,5,6,7,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", true),
  (11,12,13,14,5,6,7,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", false),
  (1,2,3,4,25,26,27,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", true),
  (1,2,3,4,5,6,7,38,39,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", false),
  (1,2,3,4,5,6,7,8,9,410, "11", "12", "13", "14", "15", "16", "17", "18", "19", true)
).toDF("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10",
  "col11", "col12", "col13", "col14", "col15", "col16", "col17", "col18", "col19", "col20")
  
val resDF = sourceDF
  .withColumn("col_result",
    when(
        'col1.equalTo(1) && 'col2.equalTo(2) && 'col3.equalTo(3) &&
        'col4.equalTo(4) && 'col5.equalTo(5) && 'col6.equalTo(6) &&
        'col7.equalTo(7) && 'col8.equalTo(8) && 'col9.equalTo(9) &&
        'col10.equalTo(10) && 'col11.equalTo("11") && 'col12.equalTo("12") &&
        'col13.equalTo("13") && 'col14.equalTo("14") && 'col15.equalTo("15") &&
        'col16.equalTo("16") && 'col17.equalTo("17") && 'col18.equalTo("18") &&
        'col19.equalTo("19") && 'col20.equalTo(true),"result").otherwise(null))
    
resDF.show(false)
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
//  |col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col_result|
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |result    |
//  |11  |12  |13  |14  |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|null      |
//  |1   |2   |3   |4   |25  |26  |27  |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |null      |
//  |1   |2   |3   |4   |5   |6   |7   |38  |39  |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|null      |
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |410  |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |null      |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+

Variant 2

val res1DF = sourceDF.withColumn("col_result_1",
  when(col("col8") === 38 || col("col20") === false, "good check" )
  .when(col("col10") === 410 && col("col17") === "17" && col("col20") === true, "next good check")
    .otherwise("we use when  many many")
)

res1DF.show(false)
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
//  |col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col_result_1          |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |we use when  many many|
//  |11  |12  |13  |14  |5   |6   |7   |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|good check            |
//  |1   |2   |3   |4   |25  |26  |27  |8   |9   |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |we use when  many many|
//  |1   |2   |3   |4   |5   |6   |7   |38  |39  |10   |11   |12   |13   |14   |15   |16   |17   |18   |19   |false|good check            |
//  |1   |2   |3   |4   |5   |6   |7   |8   |9   |410  |11   |12   |13   |14   |15   |16   |17   |18   |19   |true |next good check       |
//  +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM