[英]How to filter pyspark dataframes
I have seen many questions related to filtering pyspark dataframes but despite my best efforts I haven't been able to get any of the non-SQL solutions to work.我已经看到了许多与过滤 pyspark 数据帧相关的问题,但尽管我尽了最大努力,我仍然无法让任何非 SQL 解决方案发挥作用。
+----------+-------------+-------+--------------------+--------------+---+
|purch_date| purch_class|tot_amt| serv-provider|purch_location| id|
+----------+-------------+-------+--------------------+--------------+---+
|03/11/2017|Uncategorized| -17.53| HOVER | | 0|
|02/11/2017| Groceries| -70.05|1774 MAC'S CONVEN...| BRAMPTON | 1|
|31/10/2017|Gasoline/Fuel| -20| ESSO | | 2|
|31/10/2017| Travel| -9|TORONTO PARKING A...| TORONTO | 3|
|30/10/2017| Groceries| -1.84| LONGO'S # 2| | 4|
This did not work:这不起作用:
df1 = spark.read.csv("/some/path/to/file", sep=',')\
.filter((col('purch_location')=='BRAMPTON')
And this did not work这没有用
df1 = spark.read.csv("/some/path/to/file", sep=',')\
.filter(purch_location == 'BRAMPTON')
This (SQL expression) works but takes a VERY long time, I imagine there's a faster non-SQL approach这个(SQL 表达式)有效但需要很长时间,我想有一种更快的非 SQL 方法
df1 = spark.read.csv("/some/path/to/file", sep=',')\
.filter(purch_location == 'BRAMPTON')
UPDATE I should mention I am able to use methods like (which run faster than the SQL expression):更新我应该提到我能够使用这样的方法(比 SQL 表达式运行得更快):
df1 = spark.read.csv("/some/path/to/file", sep=',')
df2 = df1.filter(df1.purch_location == "BRAMPTON")
But want to understand why the "pipe" /
connection syntax is incorrect.但是想了解为什么“管道”
/
连接语法不正确。
you can use df["purch_location"]
你可以使用
df["purch_location"]
df = spark.read.csv("/some/path/to/file", sep=',')
df = df.filter(df["purch_location"] == "BRAMPTON")
If you insist on using the backslash, you can do:如果你坚持使用反斜杠,你可以这样做:
from pyspark.sql.functions import col
df = spark.read.csv('/some/path/to/file', sep=',') \
.filter(col('purch_location') == 'BRAMPTON')
Your first attempt failed because the brackets are not balanced.您的第一次尝试失败了,因为括号不平衡。
Also it seems there are some spaces after the string BRAMPTON
, so you might want to trim
the column first:此外,字符串
BRAMPTON
之后似乎还有一些空格,因此您可能需要先trim
该列:
from pyspark.sql.functions import col, trim
df = spark.read.csv('/some/path/to/file', sep=',') \
.filter(trim(col('purch_location')) == 'BRAMPTON')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.