Spark dataframe 过滤器问题

Question

Coming from a SQL background here.. I'm using df1 = spark.read.jdbc to load data from Azure sql into a dataframe. Coming from a SQL background here.. I'm using df1 = spark.read.jdbc to load data from Azure sql into a dataframe. I am trying to filter the data to exclude rows meeting the following criteria:我正在尝试过滤数据以排除满足以下条件的行：

df2 = df1.filter("ItemID <> '75' AND Code1 <> 'SL'")

The dataframe ends up being empty but when i run equivalent SQL it is correct. dataframe 最终是空的，但是当我运行等效的 SQL 时，它是正确的。 When i change it to当我将其更改为

df2 = df1.filter("ItemID **=** '75' AND Code1 **=** 'SL'")

it produces the rows i want to filter out.它产生我想要过滤掉的行。

What is the best way to remove the rows meeting the criteria, so they can be pushed to a SQL server?删除符合条件的行的最佳方法是什么，以便可以将它们推送到 SQL 服务器？ Thank you谢谢

Answer 1

In SQL world, <> means Checks if the value of two operands are equal or not, if values are not equal then condition becomes true.在 SQL 世界中， <>表示Checks if the value of two operands are equal or not, if values are not equal then condition becomes true.

The equivalent of it in spark sql is != . spark sql 中的等价物是!= 。 Thus your sql condition inside filter becomes-因此，过滤器内的 sql 条件变为 -

# A != B -> TRUE if expression A is not equivalent to expression B; otherwise FALSE
df2 = df1.filter("ItemID != '75' AND Code1 != 'SL'")

= has same meaning in spark sql as ansi sql =在火花 sql 中与 ansi sql 具有相同的含义

df2 = df1.filter("ItemID = '75' AND Code1 = 'SL'")

Answer 2

Use & operator with != in pyspark.在 pyspark 中使用带有!=的&运算符。

<> deprecated from python3. <>从 python3 中弃用。

Example:

df=spark.createDataFrame([(75,'SL'),(90,'SL1')],['ItemID','Code1'])

df.filter((col("ItemID") != '75') & (col("code1") != 'SL') ).show()

#or using negation
df.filter(~(col("ItemID") == '75') & ~(col("Code1") == 'SL') ).show()

#+------+-----+
#|ItemID|Code1|
#+------+-----+
#|    90|  SL1|
#+------+-----+

Spark dataframe 过滤器问题

问题描述

2 个解决方案

解决方案1
2 2020-06-26 03:40:03

解决方案2
0 2020-06-25 21:54:04

Spark dataframe 过滤器问题

问题描述

2 个解决方案

解决方案1 2 2020-06-26 03:40:03

解决方案2 0 2020-06-25 21:54:04

解决方案1
2 2020-06-26 03:40:03

解决方案2
0 2020-06-25 21:54:04