[英]PySpark Dataframes: how to filter on multiple conditions with compact code?
If I have a list of column names and I want to filter on rows if the value of those columns are greater than zero, is there something similar to this which I can do? 如果我有一个列名列表,并且如果这些列的值大于零,我想对行进行过滤,是否可以执行类似的操作?
columns = ['colA','colB','colC','colD','colE','colF']
new_df = df.filter(any([df[c]>0 for c in columns]))
This returns: 返回:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
ValueError:无法将列转换为布尔值:请对“和”,“ |”使用“&” for 'or', '~' for 'not' when building DataFrame boolean expressions
构建DataFrame布尔表达式时为'or',为'〜'为'not'
I guess I can sum those columns and the filter on only one column (since I don't have negative numbers. But if I had the sum-trick wouldn't work. And anyway if I had to filter those columns on another condition different than the sum, how could I do what I want to do? Any idea? 我想我只能将这些列和过滤器加到一个列上(因为我没有负数。但是,如果我有求和技巧,那么就行不通了。无论如何,如果我不得不在另一个条件不同的情况下过滤这些列比总和,我该怎么做我想做什么?
You can use the or_
operator instead : 您可以改用
or_
运算符:
from operator import or_
from functools import reduce
newdf = df.where(reduce(or_, (df[c] > 0 for c in df.columns)))
EDIT: More pythonista solution : 编辑:更多pythonista解决方案:
from pyspark.sql.functions import lit
def any_(*preds):
cond = lit(False)
for pred in preds:
cond = cond | pred
return cond
newdf = df.where(any_(*[df[c] > 0 for c in df.columns]))
EDIT 2: Full example : 编辑2:完整的示例:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0-SNAPSHOT
/_/
Using Python version 3.5.1 (default, Dec 7 2015 11:16:01)
SparkSession available as 'spark'.
In [1]: from pyspark.sql.functions import lit
In [2]: %pas
%paste %pastebin
In [2]: %paste
def any_(*preds):
cond = lit(False)
for pred in preds:
cond = cond | pred
return cond
## -- End pasted text --
In [3]: df = sc.parallelize([(1, 2, 3), (-1, -2, -3), (1, -1, 0)]).toDF()
In [4]: df.where(any_(*[df[c] > 0 for c in df.columns])).show()
# +---+---+---+
# | _1| _2| _3|
# +---+---+---+
# | 1| 2| 3|
# | 1| -1| 0|
# +---+---+---+
In [5]: df[any_(*[df[c] > 0 for c in df.columns])].show()
# +---+---+---+
# | _1| _2| _3|
# +---+---+---+
# | 1| 2| 3|
# | 1| -1| 0|
# +---+---+---+
In [6]: df.show()
# +---+---+---+
# | _1| _2| _3|
# +---+---+---+
# | 1| 2| 3|
# | -1| -2| -3|
# | 1| -1| 0|
# +---+---+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.