PySpark Dataframes：如何使用紧凑代码在多种条件下进行过滤？

Question

If I have a list of column names and I want to filter on rows if the value of those columns are greater than zero, is there something similar to this which I can do? 如果我有一个列名列表，并且如果这些列的值大于零，我想对行进行过滤，是否可以执行类似的操作？

columns = ['colA','colB','colC','colD','colE','colF']
new_df = df.filter(any([df[c]>0 for c in columns]))

This returns: 返回：

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' ValueError：无法将列转换为布尔值：请对“和”，“ |”使用“＆” for 'or', '~' for 'not' when building DataFrame boolean expressions 构建DataFrame布尔表达式时为'or'，为'〜'为'not'

I guess I can sum those columns and the filter on only one column (since I don't have negative numbers. But if I had the sum-trick wouldn't work. And anyway if I had to filter those columns on another condition different than the sum, how could I do what I want to do? Any idea? 我想我只能将这些列和过滤器加到一个列上（因为我没有负数。但是，如果我有求和技巧，那么就行不通了。无论如何，如果我不得不在另一个条件不同的情况下过滤这些列比总和，我该怎么做我想做什么？

Answer 1

You can use the or_ operator instead : 您可以改用or_运算符：

from operator import or_
from functools import reduce

newdf = df.where(reduce(or_, (df[c] > 0 for c in df.columns)))

EDIT: More pythonista solution : 编辑：更多pythonista解决方案：

from pyspark.sql.functions import lit

def any_(*preds):
    cond = lit(False)
    for pred in preds:
        cond = cond | pred
    return cond

newdf = df.where(any_(*[df[c] > 0 for c in df.columns]))

EDIT 2: Full example : 编辑2：完整的示例：

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0-SNAPSHOT
      /_/

Using Python version 3.5.1 (default, Dec  7 2015 11:16:01)
SparkSession available as 'spark'.

In [1]: from pyspark.sql.functions import lit

In [2]: %pas
%paste     %pastebin  

In [2]: %paste
def any_(*preds):
    cond = lit(False)
    for pred in preds:
        cond = cond | pred
    return cond

## -- End pasted text --

In [3]: df = sc.parallelize([(1, 2, 3), (-1, -2, -3), (1, -1, 0)]).toDF()

In [4]: df.where(any_(*[df[c] > 0 for c in df.columns])).show()
# +---+---+---+
# | _1| _2| _3|
# +---+---+---+
# |  1|  2|  3|
# |  1| -1|  0|
# +---+---+---+

In [5]: df[any_(*[df[c] > 0 for c in df.columns])].show()
# +---+---+---+
# | _1| _2| _3|
# +---+---+---+
# |  1|  2|  3|
# |  1| -1|  0|
# +---+---+---+

In [6]: df.show()
# +---+---+---+
# | _1| _2| _3|
# +---+---+---+
# |  1|  2|  3|
# | -1| -2| -3|
# |  1| -1|  0|
# +---+---+---+

PySpark Dataframes：如何使用紧凑代码在多种条件下进行过滤？

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-11-17 10:29:05

PySpark Dataframes：如何使用紧凑代码在多种条件下进行过滤？

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-11-17 10:29:05

解决方案1
2 已采纳 2016-11-17 10:29:05