简体   繁体   English

数据框上的多条件过滤器

[英]Multiple condition filter on dataframe

Can anyone explain to me why I am getting different results for these 2 expressions ? 谁能向我解释为什么我对这两个表达式会得到不同的结果? I am trying to filter between 2 dates: 我正在尝试在2个日期之间进行过滤:

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

Result : 37M 结果:37M

vs

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

Result: 25M 结果:25M

How are they different ? 它们有何不同? It seems to me like they should produce the same result 在我看来,他们应该产生相同的结果

TL;DR To pass multiple conditions to filter or where use Column objects and logical operators ( & , | , ~ ). TL; DR传递多个条件进行filterwhere使用Column对象和逻辑运算符( &|~ )。 See Pyspark: multiple conditions in when clause . 请参见Pyspark:when子句中的多个条件

df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))

You can also use a single SQL string: 您还可以使用单个 SQL字符串:

df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")

In practice it makes more sense to use between: 实际上,在以下两者之间使用更有意义:

df.filter(col("act_date").between("2016-10-01", "2017-04-01"))
df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")

The first approach is not even remote valid. 第一种方法甚至不是远程有效的。 In Python, and returns: 在Python, and回报:

  • The last element if all expressions are "truthy". 如果所有表达式都是“真”,则为最后一个元素。
  • The first "falsey" element otherwise. 否则,第一个“ falsey”元素。

As a result 结果是

"act_date <='2017-04-01'" and "act_date >='2016-10-01'"

is evaluated to (any non-empty string is truthy): 评估为(任何非空字符串为真):

"act_date >='2016-10-01'"

In first case 在第一种情况下

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

the result is values more than 2016-10-01 that means all the values above 2017-04-01 also. 结果是大于2016-10-01的值,这也意味着2017-04-01以上的所有值。

Whereas in second case 而在第二种情况下

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

the result is the values between 2016-10-01 to 2017-04-01. 结果是2016-10-01至2017-04-01之间的值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM