简体   繁体   English

Pandas 中布尔索引的逻辑运算符

[英]Logical operators for Boolean indexing in Pandas

I'm working with a Boolean index in Pandas.我正在使用 Pandas 中的布尔索引。

The question is why the statement:问题是为什么声明:

a[(a['some_column']==some_number) & (a['some_other_column']==some_other_number)]

works fine whereas工作正常而

a[(a['some_column']==some_number) and (a['some_other_column']==some_other_number)]

exits with error?退出错误?

Example:例子:

a = pd.DataFrame({'x':[1,1],'y':[10,20]})

In: a[(a['x']==1)&(a['y']==10)]
Out:    x   y
     0  1  10

In: a[(a['x']==1) and (a['y']==10)]
Out: ValueError: The truth value of an array with more than one element is ambiguous.     Use a.any() or a.all()

When you say当你说

(a['x']==1) and (a['y']==10)

You are implicitly asking Python to convert (a['x']==1) and (a['y']==10) to Boolean values.您隐含地要求 Python 将(a['x']==1)(a['y']==10)为布尔值。

NumPy arrays (of length greater than 1) and Pandas objects such as Series do not have a Boolean value -- in other words, they raise NumPy 数组(长度大于 1)和 Pandas 对象(如 Series)没有布尔值——换句话说,它们引发

ValueError: The truth value of an array is ambiguous. ValueError:数组的真值不明确。 Use a.empty, a.any() or a.all().使用 a.empty、a.any() 或 a.all()。

when used as a Boolean value.当用作布尔值时。 That's because it's unclear when it should be True or False .那是因为不清楚什么时候应该是 True 或 False Some users might assume they are True if they have non-zero length, like a Python list.如果它们的长度不为零,一些用户可能会认为它们是 True,例如 Python 列表。 Others might desire for it to be True only if all its elements are True.只有当它的所有元素都为真时,其他人可能希望它为真。 Others might want it to be True if any of its elements are True.如果它的任何元素为真,其他人可能希望它为真。

Because there are so many conflicting expectations, the designers of NumPy and Pandas refuse to guess, and instead raise a ValueError.因为有太多相互矛盾的期望,NumPy 和 Pandas 的设计者拒绝猜测,而是提出 ValueError。

Instead, you must be explicit, by calling the empty() , all() or any() method to indicate which behavior you desire.相反,您必须明确地调用empty()all()any()方法来指示您想要哪种行为。

In this case, however, it looks like you do not want Boolean evaluation, you want element-wise logical-and.但是,在这种情况下,您似乎不需要布尔求值,而是需要逐元素逻辑与。 That is what the & binary operator performs:这就是&二元运算符执行的操作:

(a['x']==1) & (a['y']==10)

returns a boolean array.返回一个布尔数组。


By the way, as alexpmil notes , the parentheses are mandatory since & has a higher operator precedence than == .顺便说一下,正如alexpmil所说,括号是强制性的,因为&运算符优先级高于==

Without the parentheses, a['x']==1 & a['y']==10 would be evaluated as a['x'] == (1 & a['y']) == 10 which would in turn be equivalent to the chained comparison (a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10) .如果没有括号, a['x']==1 & a['y']==10将被评估为a['x'] == (1 & a['y']) == 10这将反过来等效于链式比较(a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10) That is an expression of the form Series and Series .这是Series and Series形式的表达式。 The use of and with two Series would again trigger the same ValueError as above.使用and与两个 Series 将再次触发与上面相同的ValueError That's why the parentheses are mandatory.这就是为什么括号是强制性的。

TLDR; TLDR; Logical Operators in Pandas are & , | Pandas 中的逻辑运算符是& , | and ~ , and parentheses (...) is important!~ ,括号(...)很重要!

Python's and , or and not logical operators are designed to work with scalars. Python 的and , or and not逻辑运算符旨在与标量一起使用。 So Pandas had to do one better and override the bitwise operators to achieve vectorized (element-wise) version of this functionality.因此,Pandas 必须做得更好并覆盖位运算符以实现此功能的矢量化(逐元素)版本。

So the following in python ( exp1 and exp2 are expressions which evaluate to a boolean result)...因此python中的以下内容( exp1exp2是评估为布尔结果的表达式)...

exp1 and exp2              # Logical AND
exp1 or exp2               # Logical OR
not exp1                   # Logical NOT

...will translate to... ...将转化为...

exp1 & exp2                # Element-wise logical AND
exp1 | exp2                # Element-wise logical OR
~exp1                      # Element-wise logical NOT

for pandas.大熊猫。

If in the process of performing logical operation you get a ValueError , then you need to use parentheses for grouping:如果在执行逻辑操作的过程中你得到一个ValueError ,那么你需要使用括号进行分组:

(exp1) op (exp2)

For example,例如,

(df['col1'] == x) & (df['col2'] == y) 

And so on.等等。


Boolean Indexing : A common operation is to compute boolean masks through logical conditions to filter the data. 布尔索引:一种常见的操作是通过逻辑条件计算布尔掩码来过滤数据。 Pandas provides three operators: & for logical AND, | Pandas 提供了三种运算符: &用于逻辑与, | for logical OR, and ~ for logical NOT.用于逻辑 OR,而~用于逻辑 NOT。

Consider the following setup:考虑以下设置:

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df

   A  B  C
0  5  0  3
1  3  7  9
2  3  5  2
3  4  7  6
4  8  8  1

Logical AND逻辑与

For df above, say you'd like to return all rows where A < 5 and B > 5. This is done by computing masks for each condition separately, and ANDing them.对于上面的df ,假设您希望返回 A < 5 和 B > 5 的所有行。这是通过分别计算每个条件的掩码并将它们与运算来完成的。

Overloaded Bitwise & Operator重载的按位&运算符
Before continuing, please take note of this particular excerpt of the docs, which state在继续之前,请注意文档的这段特别摘录,其中指出

Another common operation is the use of boolean vectors to filter the data.另一种常见的操作是使用布尔向量来过滤数据。 The operators are: |运算符是: | for or , & for and , and ~ for not .or , &and , and ~not . These must be grouped by using parentheses , since by default Python will evaluate an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3 , while the desired evaluation order is (df.A > 2) & (df.B < 3) .这些必须使用括号进行分组,因为默认情况下 Python 将评估表达式,例如df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3 ,而所需的评估顺序是(df.A > 2) & (df.B < 3)

So, with this in mind, element wise logical AND can be implemented with the bitwise operator & :因此,考虑到这一点,可以使用按位运算符&来实现元素明智的逻辑&

df['A'] < 5

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'] > 5

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

(df['A'] < 5) & (df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

And the subsequent filtering step is simply,随后的过滤步骤很简单,

df[(df['A'] < 5) & (df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

The parentheses are used to override the default precedence order of bitwise operators, which have higher precedence over the conditional operators < and > .括号用于覆盖位运算符的默认优先顺序,它比条件运算符<>具有更高的优先级。 See the section of Operator Precedence in the python docs.请参阅 python 文档中的运算符优先级部分。

If you do not use parentheses, the expression is evaluated incorrectly.如果不使用括号,则表达式计算不正确。 For example, if you accidentally attempt something such as例如,如果您不小心尝试了诸如

df['A'] < 5 & df['B'] > 5

It is parsed as它被解析为

df['A'] < (5 & df['B']) > 5

Which becomes,这变成了,

df['A'] < something_you_dont_want > 5

Which becomes (see the python docs on chained operator comparison ),变成(请参阅有关链式运算符比较的 python 文档),

(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)

Which becomes,这变成了,

# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2

Which throws哪个抛出

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

So, don't make that mistake!所以,不要犯这个错误! 1 1

Avoiding Parentheses Grouping避免括号分组
The fix is actually quite simple.修复实际上非常简单。 Most operators have a corresponding bound method for DataFrames.大多数算子都有对应的 DataFrame 绑定方法。 If the individual masks are built up using functions instead of conditional operators, you will no longer need to group by parens to specify evaluation order:如果单个掩码是使用函数而不是条件运算符构建的,您将不再需要按括号分组来指定评估顺序:

df['A'].lt(5)

0     True
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'].gt(5)

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

df['A'].lt(5) & df['B'].gt(5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

See the section on Flexible Comparisons.请参阅灵活比较部分 . . To summarise, we have总而言之,我们有

╒════╤════════════╤════════════╕
│    │ Operator   │ Function   │
╞════╪════════════╪════════════╡
│  0 │ >          │ gt         │
├────┼────────────┼────────────┤
│  1 │ >=         │ ge         │
├────┼────────────┼────────────┤
│  2 │ <          │ lt         │
├────┼────────────┼────────────┤
│  3 │ <=         │ le         │
├────┼────────────┼────────────┤
│  4 │ ==         │ eq         │
├────┼────────────┼────────────┤
│  5 │ !=         │ ne         │
╘════╧════════════╧════════════╛

Another option for avoiding parentheses is to use DataFrame.query (or eval ):避免括号的另一种选择是使用DataFrame.query (或eval ):

df.query('A < 5 and B > 5')

   A  B  C
1  3  7  9
3  4  7  6

I have extensively documented query and eval in Dynamic Expression Evaluation in pandas using pd.eval() .使用 pd.eval() 在 pandas 的动态表达式评估中广泛记录了queryeval

operator.and_
Allows you to perform this operation in a functional manner.允许您以功能方式执行此操作。 Internally calls Series.__and__ which corresponds to the bitwise operator.内部调用Series.__and__对应于位运算符。

import operator 

operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5) 

0    False
1     True
2    False
3     True
4    False
dtype: bool

df[operator.and_(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

You won't usually need this, but it is useful to know.你通常不需要这个,但知道它很有用。

Generalizing: np.logical_and (and logical_and.reduce )概括: np.logical_and (和logical_and.reduce
Another alternative is using np.logical_and , which also does not need parentheses grouping:另一种选择是使用np.logical_and ,它也不需要括号分组:

np.logical_and(df['A'] < 5, df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
Name: A, dtype: bool

df[np.logical_and(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

np.logical_and is a ufunc (Universal Functions) , and most ufuncs have a reduce method. np.logical_and是一个np.logical_and (通用函数) ,大多数 ufunc 都有一个reduce方法。 This means it is easier to generalise with logical_and if you have multiple masks to AND.这意味着如果您有多个 AND 掩码,则使用logical_and更容易进行概括。 For example, to AND masks m1 and m2 and m3 with & , you would have to do例如,要和面具m1m2m3& ,你就必须做

m1 & m2 & m3

However, an easier option is然而,一个更简单的选择是

np.logical_and.reduce([m1, m2, m3])

This is powerful, because it lets you build on top of this with more complex logic (for example, dynamically generating masks in a list comprehension and adding all of them):这很强大,因为它允许您在此之上构建更复杂的逻辑(例如,在列表推导中动态生成掩码并添加所有掩码):

import operator

cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]

m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m 
# array([False,  True, False,  True, False])

df[m]
   A  B  C
1  3  7  9
3  4  7  6

1 - I know I'm harping on this point, but please bear with me. 1 - 我知道我在强调这一点,但请耐心等待。 This is a very , very common beginner's mistake, and must be explained very thoroughly.这是一个非常非常常见的初学者错误,必须非常彻底地解释。


Logical OR逻辑或

For the df above, say you'd like to return all rows where A == 3 or B == 7.对于上面的df ,假设您要返回 A == 3 或 B == 7 的所有行。

Overloaded Bitwise |按位重载|

df['A'] == 3

0    False
1     True
2     True
3    False
4    False
Name: A, dtype: bool

df['B'] == 7

0    False
1     True
2    False
3     True
4    False
Name: B, dtype: bool

(df['A'] == 3) | (df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[(df['A'] == 3) | (df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

If you haven't yet, please also read the section on Logical AND above, all caveats apply here.如果您还没有,请同时阅读逻辑和上面的部分,所有注意事项都适用于此处。

Alternatively, this operation can be specified with或者,可以使用指定此操作

df[df['A'].eq(3) | df['B'].eq(7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

operator.or_
Calls Series.__or__ under the hood.Series.__or__调用Series.__or__

operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[operator.or_(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

np.logical_or
For two conditions, use logical_or :对于两个条件,使用logical_or

np.logical_or(df['A'] == 3, df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df[np.logical_or(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

For multiple masks, use logical_or.reduce :对于多个掩码,请使用logical_or.reduce

np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False,  True,  True,  True, False])

df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

Logical NOT逻辑非

Given a mask, such as给定一个掩码,例如

mask = pd.Series([True, True, False])

If you need to invert every boolean value (so that the end result is [False, False, True] ), then you can use any of the methods below.如果您需要反转每个布尔值(以便最终结果为[False, False, True] ),那么您可以使用以下任何方法。

Bitwise ~按位~

~mask

0    False
1    False
2     True
dtype: bool

Again, expressions need to be parenthesised.同样,表达式需要用括号括起来。

~(df['A'] == 3)

0     True
1    False
2    False
3     True
4     True
Name: A, dtype: bool

This internally calls这在内部调用

mask.__invert__()

0    False
1    False
2     True
dtype: bool

But don't use it directly.但不要直接使用。

operator.inv
Internally calls __invert__ on the Series.在 Series 上内部调用__invert__

operator.inv(mask)

0    False
1    False
2     True
dtype: bool

np.logical_not
This is the numpy variant.这是 numpy 变体。

np.logical_not(mask)

0    False
1    False
2     True
dtype: bool

Note, np.logical_and can be substituted for np.bitwise_and , logical_or with bitwise_or , and logical_not with invert .注意, np.logical_and可以取代np.bitwise_andlogical_orbitwise_or ,和logical_notinvert

Logical operators for boolean indexing in Pandas Pandas 中布尔索引的逻辑运算符

It's important to realize that you cannot use any of the Python logical operators ( and , or or not ) on pandas.Series or pandas.DataFrame s (similarly you cannot use them on numpy.array s with more than one element).重要的是要认识到您不能在pandas.Seriespandas.DataFrame上使用任何 Python逻辑运算符andor or not )(类似地,您不能在具有多个元素的numpy.array上使用它们)。 The reason why you cannot use those is because they implicitly call bool on their operands which throws an Exception because these data structures decided that the boolean of an array is ambiguous:您不能使用它们的原因是因为它们隐式调用bool在其操作数上引发异常,因为这些数据结构决定数组的布尔值是不明确的:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I did cover this more extensively in my answer to the "Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()" Q+A .在回答“系列的真值是模棱两可的。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()”Q 的回答中确实更广泛地涵盖了这一点+一个

NumPy's logical functions NumPy 的逻辑函数

However NumPy provides element-wise operating equivalents to these operators as functions that can be used on numpy.array , pandas.Series , pandas.DataFrame , or any other (conforming) numpy.array subclass:然而, NumPy提供了这些运算符的元素操作等效项,作为可用于numpy.arraypandas.Seriespandas.DataFrame或任何其他(符合) numpy.array子类的numpy.array

So, essentially, one should use (assuming df1 and df2 are Pandas DataFrames):所以,本质上,应该使用(假设df1df2是 Pandas DataFrames):

np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)

Bitwise functions and bitwise operators for Booleans布尔值的位函数和位运算符

However in case you have boolean NumPy array, Pandas Series, or Pandas DataFrames you could also use the element-wise bitwise functions (for booleans they are - or at least should be - indistinguishable from the logical functions):但是,如果您有布尔 NumPy 数组、Pandas 系列或 Pandas DataFrames,您也​​可以使用逐元素位函数(对于布尔值,它们 - 或至少应该 - 与逻辑函数无法区分):

Typically the operators are used.通常使用运算符。 However when combined with comparison operators one has to remember to wrap the comparison in parenthesis because the bitwise operators have a higher precedence than the comparison operators :但是,当与比较运算符结合使用时,必须记住将比较括在括号中,因为按位运算符的优先级高于比较运算符

(df1 < 10) | (df2 > 10)  # instead of the wrong df1 < 10 | df2 > 10

This may be irritating because the Python logical operators have a lower precedence than the comparison operators, so you normally write a < 10 and b > 10 (where a and b are for example simple integers) and don't need the parenthesis.这可能很烦人,因为 Python 逻辑运算符的优先级低于比较运算符,因此您通常编写a < 10 and b > 10 (其中ab是例如简单整数)并且不需要括号。

Differences between logical and bitwise operations (on non-booleans)逻辑运算和按位运算之间的差异(在非布尔值上)

It is really important to stress that bit and logical operations are only equivalent for Boolean NumPy arrays (and boolean Series & DataFrames).强调位和逻辑操作仅对布尔 NumPy 数组(以及布尔系列和数据帧)是等效的,这一点非常重要。 If these don't contain Booleans then the operations will give different results.如果这些不包含布尔值,那么操作将给出不同的结果。 I'll include examples using NumPy arrays, but the results will be similar for the pandas data structures:我将包含使用 NumPy 数组的示例,但对于 pandas 数据结构,结果将是相似的:

>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])

>>> np.logical_and(a1, a2)
array([False, False, False,  True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)

And since NumPy (and similarly Pandas) does different things for Boolean ( Boolean or “mask” index arrays ) and integer ( Index arrays ) indices the results of indexing will be also be different:由于 NumPy(和类似的 Pandas)对布尔( 布尔或“掩码”索引数组)和整数( 索引数组)索引做了不同的事情,索引的结果也会不同:

>>> a3 = np.array([1, 2, 3, 4])

>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])

Summary table汇总表

Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
       and       |  np.logical_and        | np.bitwise_and         |        &
-------------------------------------------------------------------------------------
       or        |  np.logical_or         | np.bitwise_or          |        |
-------------------------------------------------------------------------------------
                 |  np.logical_xor        | np.bitwise_xor         |        ^
-------------------------------------------------------------------------------------
       not       |  np.logical_not        | np.invert              |        ~

Where the logical operator does not work for NumPy arrays , Pandas Series, and pandas DataFrames.逻辑运算符不适用于 NumPy 数组、Pandas Series 和 pandas DataFrames。 The others work on these data structures (and plain Python objects) and work element-wise.其他人处理这些数据结构(和纯 Python 对象)并按元素工作。 However, be careful with the bitwise invert on plain Python bool s because the bool will be interpreted as integers in this context (for example ~False returns -1 and ~True returns -2 ).但是,请注意普通 Python bool上的按位反转,因为在这种情况下 bool 将被解释为整数(例如~False返回-1~True返回-2 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM