简体   繁体   中英

Spark dataframe filter/where is not working for multiple conditions


val someDF = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, "horse"),
  (10, null),
  (11, "")
).toDF("number", "word")


Using this above data frame I'm trying to filter out null and empty word column values.

Trail - 1

someDF.filter(col("word") =!= "" || col("word").isNotNull).show(false)
+------+-----+
|number|word |
+------+-----+
|8     |bat  |
|64    |mouse|
|-27   |horse|
|11    |     |
+------+-----+

I have used OR condition but still, it is not removing the empty string word column values.


Trail - 2

someDF.filter(col("word") =!= "").filter(col("word").isNotNull).show(false)
+------+-----+
|number|word |
+------+-----+
|8     |bat  |
|64    |mouse|
|-27   |horse|
+------+-----+


In trail - 2 I have used the chain filter then it removed both null and empty values from the data frame.


Trail - 3


someDF.filter(col("word") =!= "" && col("word").isNotNull).show(false)
+------+-----+
|number|word |
+------+-----+
|8     |bat  |
|64    |mouse|
|-27   |horse|
+------+-----+


In trail -3 I have used AND operation then it removed the null/empty values.

Can anyone please explain to me why with OR operation it's not working? Is something wrong in my code?

In general Spark SQL (including SQL and the DataFrame and Dataset API) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right “short-circuiting” semantics.

Therefore, it is dangerous to rely on the side effects or order of evaluation of Boolean expressions, and the order of WHERE and HAVING clauses, since such expressions and clauses can be reordered during query optimization and planning. Specifically, if a UDF relies on short-circuiting semantics in SQL for null checking, there's no guarantee that the null check will happen before invoking the UDF. For example,

now lets see your examples

trail 1: someDF.filter(col("word") =!= "" || col("word").isNotNull).show(false)

its a logical or operator meaning its enough for one side to be true : "" =!= "" -> false "".isnotNull -> true

meaning an empty word is true and should not be filtered out

trail 2 and 3 are the same you are using the logical and operator "" =!= "" -> false which is enough to decide that the expression is false and should be filtered out.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM