Here is the scenario from legacy code need to converted to spark scala. Any pointers will be highly appreciated.
Scenario: Need to add a new column to a dataframe/dataset using "withColumn" based on conditionally checking 20 - 22 other columns values and then derive this new column value. Any pointers on how to implement this in spark Scala? Thanks much. I have tried using UDF and passing a map of 22 column as key:value and If else checking with mutable variables but was informed by the experts from this forum that's not recommended so seeking guidance on what is the right way to achieve this?
OR using dataset.mappartitions and using the mutable variables inside this function is right way to do it?
val calculate = dataset.mapPartitions(partition => partition.map(x => {
var value1 = "NA"
var value1 = "NA"
set the values of the mutable variables value1 and value2 based on the column values
if ( x.fieldA ="xyx")
{
value1 = "ABC"
value2 = "cbz
}
eles if (x.fieldA ="112" & x.fieldB ="xy1")
{
value1 = "zya"
value2 = "ab"
}
df(
x.fldC
x.fldB
value1
value2
)
}
case class df(fldc:String,fldb:String,value1:String:value2:String)
Can you please let me know what other details I should provide as I have updated the question above?
I am new to this distributed/spark scala development so might be asking basic questions.
import spark.implicits._
import org.apache.spark.sql.functions._
val sourceDF = Seq(
(1,2,3,4,5,6,7,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", true),
(11,12,13,14,5,6,7,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", false),
(1,2,3,4,25,26,27,8,9,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", true),
(1,2,3,4,5,6,7,38,39,10, "11", "12", "13", "14", "15", "16", "17", "18", "19", false),
(1,2,3,4,5,6,7,8,9,410, "11", "12", "13", "14", "15", "16", "17", "18", "19", true)
).toDF("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10",
"col11", "col12", "col13", "col14", "col15", "col16", "col17", "col18", "col19", "col20")
val resDF = sourceDF
.withColumn("col_result",
when(
'col1.equalTo(1) && 'col2.equalTo(2) && 'col3.equalTo(3) &&
'col4.equalTo(4) && 'col5.equalTo(5) && 'col6.equalTo(6) &&
'col7.equalTo(7) && 'col8.equalTo(8) && 'col9.equalTo(9) &&
'col10.equalTo(10) && 'col11.equalTo("11") && 'col12.equalTo("12") &&
'col13.equalTo("13") && 'col14.equalTo("14") && 'col15.equalTo("15") &&
'col16.equalTo("16") && 'col17.equalTo("17") && 'col18.equalTo("18") &&
'col19.equalTo("19") && 'col20.equalTo(true),"result").otherwise(null))
resDF.show(false)
// +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
// |col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col_result|
// +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
// |1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |19 |true |result |
// |11 |12 |13 |14 |5 |6 |7 |8 |9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |19 |false|null |
// |1 |2 |3 |4 |25 |26 |27 |8 |9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |19 |true |null |
// |1 |2 |3 |4 |5 |6 |7 |38 |39 |10 |11 |12 |13 |14 |15 |16 |17 |18 |19 |false|null |
// |1 |2 |3 |4 |5 |6 |7 |8 |9 |410 |11 |12 |13 |14 |15 |16 |17 |18 |19 |true |null |
// +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
Variant 2
val res1DF = sourceDF.withColumn("col_result_1",
when(col("col8") === 38 || col("col20") === false, "good check" )
.when(col("col10") === 410 && col("col17") === "17" && col("col20") === true, "next good check")
.otherwise("we use when many many")
)
res1DF.show(false)
// +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
// |col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col_result_1 |
// +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
// |1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |19 |true |we use when many many|
// |11 |12 |13 |14 |5 |6 |7 |8 |9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |19 |false|good check |
// |1 |2 |3 |4 |25 |26 |27 |8 |9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |19 |true |we use when many many|
// |1 |2 |3 |4 |5 |6 |7 |38 |39 |10 |11 |12 |13 |14 |15 |16 |17 |18 |19 |false|good check |
// |1 |2 |3 |4 |5 |6 |7 |8 |9 |410 |11 |12 |13 |14 |15 |16 |17 |18 |19 |true |next good check |
// +----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.