Pyspark 数据框过滤器 OR 条件

Question

I am trying to filter my pyspark dataframe based on an OR condition like so:我正在尝试根据 OR 条件过滤我的 pyspark 数据框，如下所示：

filtered_df = file_df.filter(file_df.dst_name == "ntp.obspm.fr").filter(file_df.fw == "4940" | file_df.fw == "4960")

I want to return only rows where file_df.fw == "4940" OR file_df.fw == "4960" However when I try this I get this error:我只想返回 file_df.fw == "4940" OR file_df.fw == "4960" 的行但是当我尝试这个时，我得到这个错误：

Py4JError: An error occurred while calling o157.or. Trace:
py4j.Py4JException: Method or([class java.lang.String]) does not exist

What am I doing wrong?我究竟做错了什么？

Without the OR condition it works when I try to filter only on one condition ( file_df.fw=="4940" )如果没有 OR 条件，当我尝试仅在一种条件下进行过滤时它会起作用（ file_df.fw=="4940" ）

Answer 1

The error message is caused by the different priorities of the operators.错误消息是由操作员的不同优先级引起的。 The |的| (OR) has a higher priority as the comparison operator == . (OR) 作为比较运算符==具有更高的优先级。 Spark tries to apply the OR on Spark 尝试将 OR 应用于
"4940" and file_df.fw and not like you want it on (file_df.fw == "4940") and (file_df.fw == "4960") . "4940"和file_df.fw而不是你想要的(file_df.fw == "4940")和(file_df.fw == "4960") 。 You can change the priorities by using brackets.您可以使用括号更改优先级。 Have a look at the following example:看看下面的例子：

columns = ['dst_name','fw']

file_df=spark.createDataFrame([('ntp.obspm.fr','3000'),
                               ('ntp.obspm.fr','4940'),
                               ('ntp.obspm.fr','4960'),
                               ('ntp.obspm.de', '4940' )],
                              columns)

#here I have added the brackets
filtered_df = file_df.filter(file_df.dst_name == "ntp.obspm.fr").filter((file_df.fw == "4940") | (file_df.fw == "4960"))
filtered_df.show()

Output:输出：

+------------+----+ 
|    dst_name|  fw| 
+------------+----+ 
|ntp.obspm.fr|4940| 
|ntp.obspm.fr|4960| 
+------------+----+

Pyspark 数据框过滤器 OR 条件

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-04-11 13:02:56

Pyspark 数据框过滤器 OR 条件

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-04-11 13:02:56

解决方案1
2 已采纳 2019-04-11 13:02:56