how to get the value in a spark dataframe based on a condition

Question

I have a DataFrame like below.The column values are in the format - 1_3_del_8_3 which is basically two values delimited by " _ del_ ". Here we will get two parts - 1_3 and 8_3 . We can omit the second part here. I am looking for the best approach to get the value of the last column whose first part is not 0_0 .

Also, I have 50+ columns in the DF.

Sample DataFrame

+-----------+-----------+-----------+
|         c1|         c2|         c3|
+-----------+-----------+-----------+
|1_3_del_8_3|2_3_del_6_3|0_0_del_8_3|
|2_9_del_4_3|0_0_del_4_3|2_5_del_4_3|
|2_8_del_4_3|0_0_del_4_3|0_0_del_4_3|
|5_3_del_4_3|2_3_del_4_3|7_3_del_4_3|
+-----------+-----------+-----------+

Expected Result

+-----------+-----------+-----------+-----------+
|         c1|         c2|         c3|C4Out      |
+-----------+-----------+-----------+-----------+
|1_3_del_8_3|2_3_del_6_3|0_0_del_8_3|   2_3     |
|2_9_del_4_3|0_0_del_4_3|2_5_del_4_3|   2_5     |
|2_8_del_4_3|0_0_del_4_3|0_0_del_4_3|   2_8     |
|5_3_del_4_3|2_3_del_4_3|7_3_del_4_3|   7_3     |
+-----------+-----------+-----------+-----------+

My Tryouts-

val df1=Seq(
("1_3_del_8_3","2_3_del_6_3","0_0_del_8_3"),
("2_9_del_4_3","0_0_del_4_3","2_5_del_4_3"),
("2_8_del_4_3","0_0_del_4_3","0_0_del_4_3"),
("5_3_del_4_3","2_3_del_4_3","7_3_del_4_3")
)toDF("c1","c2","c3")

I then tried splitting the columns based on delimiter and then I got stuck on how to proceed.I tried searching a lot but didint get a smimilar question here.

Answer 1

You can use when() to conditionally check your use case:

when(Condition,value_if_condition_true).otherwise(value_if_condition_false)

According to your use case, we can start checking from the right side. If c3 satisfies the condition then the value from c3 will be selected else it will check in c2 and then c1 otherwise null.

scala> df.show()
+-----------+-----------+-----------+
|         c1|         c2|         c3|
+-----------+-----------+-----------+
|1_3_del_8_3|2_3_del_6_3|0_0_del_8_3|
|2_9_del_4_3|0_0_del_4_3|2_5_del_4_3|
|2_8_del_4_3|0_0_del_4_3|0_0_del_4_3|
|5_3_del_4_3|2_3_del_4_3|7_3_del_4_3|
+-----------+-----------+-----------+

scala> df.withColumn("C4Out",
when(substring($"c3",1,3)!=="0_0",substring($"c3",9,11))
.when(substring($"c2",1,3)!=="0_0",substring($"c2",9,11))
.when(substring($"c1",1,3)!=="0_0",substring($"c1",9,11))
.otherwise(null)).show()
+-----------+-----------+-----------+-----------+
|         c1|         c2|         c3|C4Out      |
+-----------+-----------+-----------+-----------+
|1_3_del_8_3|2_3_del_6_3|0_0_del_8_3|   2_3     |
|2_9_del_4_3|0_0_del_4_3|2_5_del_4_3|   2_5     |
|2_8_del_4_3|0_0_del_4_3|0_0_del_4_3|   2_8     |
|5_3_del_4_3|2_3_del_4_3|7_3_del_4_3|   7_3     |
+-----------+-----------+-----------+-----------+

Answer 2

Check below code to get same result in dynamic.

Logic

Step 1 : Combine first value of all columns except the column which contains "0_0" with space.

For example take first row - |1_3_del_8_3|2_3_del_6_3|0_0_del_8_3| combine first value of all columns except 0_0 with space 1_3 2_3

Step 2: Split values with the space - Array(1_3,2_3)

Step 3: Taking the last value from the array. - 2_3 is value for first row.

Follow same way for remaining rows. Please check code below.

scala> val df = Seq(("1_3_del_8_3","2_3_del_6_3","0_0_del_8_3"),("2_9_del_4_3","0_0_del_4_3","2_5_del_4_3"),("2_8_del_4_3","0_0_del_4_3","0_0_del_4_3"),("5_3_del_4_3","2_3_del_4_3","7_3_del_4_3"))toDF("c1","c2","c3") // Creating DF.
df: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 1 more field]

scala> val exprs = columns.map(c => when(substring(col(s"${c}"),1,3) =!= "0_0",substring(col(s"${c}"),1,3))) // Creating expressions
exprs: Array[org.apache.spark.sql.Column] = Array(CASE WHEN (NOT (substring(c1, 1, 3) = 0_0)) THEN substring(c1, 1, 3) END, CASE WHEN (NOT (substring(c2, 1, 3) = 0_0)) THEN substring(c2, 1, 3) END, CASE WHEN (NOT (substring(c3, 1, 3) = 0_0)) THEN substring(c3, 1, 3) END)

scala> val array = (split(concat_ws(" ",exprs:_*)," ")) // Extracting Arrays
array: org.apache.spark.sql.Column = split(concat_ws( , CASE WHEN (NOT (substring(c1, 1, 3) = 0_0)) THEN substring(c1, 1, 3) END, CASE WHEN (NOT (substring(c2, 1, 3) = 0_0)) THEN substring(c2, 1, 3) END, CASE WHEN (NOT (substring(c3, 1, 3) = 0_0)) THEN substring(c3, 1, 3) END),  )

scala> val element = array((size(array)-1).cast("int")) // Extracting matched element
element: org.apache.spark.sql.Column = split(concat_ws( , CASE WHEN (NOT (substring(c1, 1, 3) = 0_0)) THEN substring(c1, 1, 3) END, CASE WHEN (NOT (substring(c2, 1, 3) = 0_0)) THEN substring(c2, 1, 3) END, CASE WHEN (NOT (substring(c3, 1, 3) = 0_0)) THEN substring(c3, 1, 3) END),  )[CAST((size(split(concat_ws( , CASE WHEN (NOT (substring(c1, 1, 3) = 0_0)) THEN substring(c1, 1, 3) END, CASE WHEN (NOT (substring(c2, 1, 3) = 0_0)) THEN substring(c2, 1, 3) END, CASE WHEN (NOT (substring(c3, 1, 3) = 0_0)) THEN substring(c3, 1, 3) END),  )) - 1) AS INT)]

scala> spark.time{ df.withColumn("c4",element).show} // showing final result.
+-----------+-----------+-----------+---+
|         c1|         c2|         c3| c4|
+-----------+-----------+-----------+---+
|1_3_del_8_3|2_3_del_6_3|0_0_del_8_3|2_3|
|2_9_del_4_3|0_0_del_4_3|2_5_del_4_3|2_5|
|2_8_del_4_3|0_0_del_4_3|0_0_del_4_3|2_8|
|5_3_del_4_3|2_3_del_4_3|7_3_del_4_3|7_3|
+-----------+-----------+-----------+---+

Time taken: 20 ms

how to get the value in a spark dataframe based on a condition

Question

2 answers

solution1
1 2020-05-02 09:21:04

solution2
1 ACCPTED 2020-05-02 11:35:45

how to get the value in a spark dataframe based on a condition

Question

2 answers

solution1 1 2020-05-02 09:21:04

solution2 1 ACCPTED 2020-05-02 11:35:45

solution1
1 2020-05-02 09:21:04

solution2
1 ACCPTED 2020-05-02 11:35:45