如何过滤 pyspark 数据帧

Question

I have seen many questions related to filtering pyspark dataframes but despite my best efforts I haven't been able to get any of the non-SQL solutions to work.我已经看到了许多与过滤 pyspark 数据帧相关的问题，但尽管我尽了最大努力，我仍然无法让任何非 SQL 解决方案发挥作用。

+----------+-------------+-------+--------------------+--------------+---+
|purch_date|  purch_class|tot_amt|       serv-provider|purch_location| id|
+----------+-------------+-------+--------------------+--------------+---+
|03/11/2017|Uncategorized| -17.53|             HOVER  |              |  0|
|02/11/2017|    Groceries| -70.05|1774 MAC'S CONVEN...|     BRAMPTON |  1|
|31/10/2017|Gasoline/Fuel|    -20|              ESSO  |              |  2|
|31/10/2017|       Travel|     -9|TORONTO PARKING A...|      TORONTO |  3|
|30/10/2017|    Groceries|  -1.84|         LONGO'S # 2|              |  4|

This did not work:这不起作用：

df1 = spark.read.csv("/some/path/to/file", sep=',')\
            .filter((col('purch_location')=='BRAMPTON')

And this did not work这没有用

df1 = spark.read.csv("/some/path/to/file", sep=',')\
            .filter(purch_location == 'BRAMPTON')

This (SQL expression) works but takes a VERY long time, I imagine there's a faster non-SQL approach这个（SQL 表达式）有效但需要很长时间，我想有一种更快的非 SQL 方法

df1 = spark.read.csv("/some/path/to/file", sep=',')\
            .filter(purch_location == 'BRAMPTON')

UPDATE I should mention I am able to use methods like (which run faster than the SQL expression):更新我应该提到我能够使用这样的方法（比 SQL 表达式运行得更快）：

df1 = spark.read.csv("/some/path/to/file", sep=',')
df2 = df1.filter(df1.purch_location == "BRAMPTON")

But want to understand why the "pipe" / connection syntax is incorrect.但是想了解为什么“管道” /连接语法不正确。

Answer 1

you can use df["purch_location"]你可以使用df["purch_location"]

df = spark.read.csv("/some/path/to/file", sep=',')
df = df.filter(df["purch_location"] == "BRAMPTON")

Answer 2

If you insist on using the backslash, you can do:如果你坚持使用反斜杠，你可以这样做：

from pyspark.sql.functions import col

df = spark.read.csv('/some/path/to/file', sep=',') \
     .filter(col('purch_location') == 'BRAMPTON')

Your first attempt failed because the brackets are not balanced.您的第一次尝试失败了，因为括号不平衡。

Also it seems there are some spaces after the string BRAMPTON , so you might want to trim the column first:此外，字符串BRAMPTON之后似乎还有一些空格，因此您可能需要先trim该列：

from pyspark.sql.functions import col, trim

df = spark.read.csv('/some/path/to/file', sep=',') \
     .filter(trim(col('purch_location')) == 'BRAMPTON')

如何过滤 pyspark 数据帧

问题描述

2 个解决方案

解决方案1
2 2021-01-08 04:24:33

解决方案2
1 已采纳 2021-01-08 07:55:41

如何过滤 pyspark 数据帧

问题描述

2 个解决方案

解决方案1 2 2021-01-08 04:24:33

解决方案2 1 已采纳 2021-01-08 07:55:41

解决方案1
2 2021-01-08 04:24:33

解决方案2
1 已采纳 2021-01-08 07:55:41