[英]Logical operators for Boolean indexing in Pandas
I'm working with a Boolean index in Pandas.我正在使用 Pandas 中的布尔索引。
The question is why the statement:问题是为什么声明:
a[(a['some_column']==some_number) & (a['some_other_column']==some_other_number)]
works fine whereas工作正常而
a[(a['some_column']==some_number) and (a['some_other_column']==some_other_number)]
exits with error?退出错误?
Example:例子:
a = pd.DataFrame({'x':[1,1],'y':[10,20]})
In: a[(a['x']==1)&(a['y']==10)]
Out: x y
0 1 10
In: a[(a['x']==1) and (a['y']==10)]
Out: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
When you say当你说
(a['x']==1) and (a['y']==10)
You are implicitly asking Python to convert (a['x']==1)
and (a['y']==10)
to Boolean values.您隐含地要求 Python 将
(a['x']==1)
和(a['y']==10)
为布尔值。
NumPy arrays (of length greater than 1) and Pandas objects such as Series do not have a Boolean value -- in other words, they raise NumPy 数组(长度大于 1)和 Pandas 对象(如 Series)没有布尔值——换句话说,它们引发
ValueError: The truth value of an array is ambiguous.
ValueError:数组的真值不明确。 Use a.empty, a.any() or a.all().
使用 a.empty、a.any() 或 a.all()。
when used as a Boolean value.当用作布尔值时。 That's because it's unclear when it should be True or False .
那是因为不清楚什么时候应该是 True 或 False 。 Some users might assume they are True if they have non-zero length, like a Python list.
如果它们的长度不为零,一些用户可能会认为它们是 True,例如 Python 列表。 Others might desire for it to be True only if all its elements are True.
只有当它的所有元素都为真时,其他人可能希望它为真。 Others might want it to be True if any of its elements are True.
如果它的任何元素为真,其他人可能希望它为真。
Because there are so many conflicting expectations, the designers of NumPy and Pandas refuse to guess, and instead raise a ValueError.因为有太多相互矛盾的期望,NumPy 和 Pandas 的设计者拒绝猜测,而是提出 ValueError。
Instead, you must be explicit, by calling the empty()
, all()
or any()
method to indicate which behavior you desire.相反,您必须明确地调用
empty()
、 all()
或any()
方法来指示您想要哪种行为。
In this case, however, it looks like you do not want Boolean evaluation, you want element-wise logical-and.但是,在这种情况下,您似乎不需要布尔求值,而是需要逐元素逻辑与。 That is what the
&
binary operator performs:这就是
&
二元运算符执行的操作:
(a['x']==1) & (a['y']==10)
returns a boolean array.返回一个布尔数组。
By the way, as alexpmil notes , the parentheses are mandatory since &
has a higher operator precedence than ==
.顺便说一下,正如alexpmil所说,括号是强制性的,因为
&
的运算符优先级高于==
。
Without the parentheses, a['x']==1 & a['y']==10
would be evaluated as a['x'] == (1 & a['y']) == 10
which would in turn be equivalent to the chained comparison (a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10)
.如果没有括号,
a['x']==1 & a['y']==10
将被评估为a['x'] == (1 & a['y']) == 10
这将反过来等效于链式比较(a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10)
。 That is an expression of the form Series and Series
.这是
Series and Series
形式的表达式。 The use of and
with two Series would again trigger the same ValueError
as above.使用
and
与两个 Series 将再次触发与上面相同的ValueError
。 That's why the parentheses are mandatory.这就是为什么括号是强制性的。
&
, |
&
, |
and ~
, and parentheses (...)
is important!~
,括号(...)
很重要! Python's and
, or
and not
logical operators are designed to work with scalars. Python 的
and
, or
and not
逻辑运算符旨在与标量一起使用。 So Pandas had to do one better and override the bitwise operators to achieve vectorized (element-wise) version of this functionality.因此,Pandas 必须做得更好并覆盖位运算符以实现此功能的矢量化(逐元素)版本。
So the following in python ( exp1
and exp2
are expressions which evaluate to a boolean result)...因此python中的以下内容(
exp1
和exp2
是评估为布尔结果的表达式)...
exp1 and exp2 # Logical AND
exp1 or exp2 # Logical OR
not exp1 # Logical NOT
...will translate to... ...将转化为...
exp1 & exp2 # Element-wise logical AND
exp1 | exp2 # Element-wise logical OR
~exp1 # Element-wise logical NOT
for pandas.大熊猫。
If in the process of performing logical operation you get a ValueError
, then you need to use parentheses for grouping:如果在执行逻辑操作的过程中你得到一个
ValueError
,那么你需要使用括号进行分组:
(exp1) op (exp2)
For example,例如,
(df['col1'] == x) & (df['col2'] == y)
And so on.等等。
Boolean Indexing : A common operation is to compute boolean masks through logical conditions to filter the data. 布尔索引:一种常见的操作是通过逻辑条件计算布尔掩码来过滤数据。 Pandas provides three operators:
&
for logical AND, |
Pandas 提供了三种运算符:
&
用于逻辑与, |
for logical OR, and ~
for logical NOT.用于逻辑 OR,而
~
用于逻辑 NOT。
Consider the following setup:考虑以下设置:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df
A B C
0 5 0 3
1 3 7 9
2 3 5 2
3 4 7 6
4 8 8 1
For df
above, say you'd like to return all rows where A < 5 and B > 5. This is done by computing masks for each condition separately, and ANDing them.对于上面的
df
,假设您希望返回 A < 5 和 B > 5 的所有行。这是通过分别计算每个条件的掩码并将它们与运算来完成的。
Overloaded Bitwise &
Operator重载的按位
&
运算符
Before continuing, please take note of this particular excerpt of the docs, which state在继续之前,请注意文档的这段特别摘录,其中指出
Another common operation is the use of boolean vectors to filter the data.
另一种常见的操作是使用布尔向量来过滤数据。 The operators are:
|
运算符是:
|
foror
,&
forand
, and~
fornot
.为
or
,&
为and
, and~
为not
. These must be grouped by using parentheses , since by default Python will evaluate an expression such asdf.A > 2 & df.B < 3
asdf.A > (2 & df.B) < 3
, while the desired evaluation order is(df.A > 2) & (df.B < 3)
.这些必须使用括号进行分组,因为默认情况下 Python 将评估表达式,例如
df.A > 2 & df.B < 3
asdf.A > (2 & df.B) < 3
,而所需的评估顺序是(df.A > 2) & (df.B < 3)
。
So, with this in mind, element wise logical AND can be implemented with the bitwise operator &
:因此,考虑到这一点,可以使用按位运算符
&
来实现元素明智的逻辑&
:
df['A'] < 5
0 False
1 True
2 True
3 True
4 False
Name: A, dtype: bool
df['B'] > 5
0 False
1 True
2 False
3 True
4 True
Name: B, dtype: bool
(df['A'] < 5) & (df['B'] > 5)
0 False
1 True
2 False
3 True
4 False
dtype: bool
And the subsequent filtering step is simply,随后的过滤步骤很简单,
df[(df['A'] < 5) & (df['B'] > 5)]
A B C
1 3 7 9
3 4 7 6
The parentheses are used to override the default precedence order of bitwise operators, which have higher precedence over the conditional operators <
and >
.括号用于覆盖位运算符的默认优先顺序,它比条件运算符
<
和>
具有更高的优先级。 See the section of Operator Precedence in the python docs.请参阅 python 文档中的运算符优先级部分。
If you do not use parentheses, the expression is evaluated incorrectly.如果不使用括号,则表达式计算不正确。 For example, if you accidentally attempt something such as
例如,如果您不小心尝试了诸如
df['A'] < 5 & df['B'] > 5
It is parsed as它被解析为
df['A'] < (5 & df['B']) > 5
Which becomes,这变成了,
df['A'] < something_you_dont_want > 5
Which becomes (see the python docs on chained operator comparison ),变成(请参阅有关链式运算符比较的 python 文档),
(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)
Which becomes,这变成了,
# Both operands are Series...
something_else_you_dont_want1
and something_else_you_dont_want2
Which throws哪个抛出
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So, don't make that mistake!所以,不要犯这个错误! 1
1
Avoiding Parentheses Grouping避免括号分组
The fix is actually quite simple.修复实际上非常简单。 Most operators have a corresponding bound method for DataFrames.
大多数算子都有对应的 DataFrame 绑定方法。 If the individual masks are built up using functions instead of conditional operators, you will no longer need to group by parens to specify evaluation order:
如果单个掩码是使用函数而不是条件运算符构建的,您将不再需要按括号分组来指定评估顺序:
df['A'].lt(5)
0 True
1 True
2 True
3 True
4 False
Name: A, dtype: bool
df['B'].gt(5)
0 False
1 True
2 False
3 True
4 True
Name: B, dtype: bool
df['A'].lt(5) & df['B'].gt(5)
0 False
1 True
2 False
3 True
4 False
dtype: bool
See the section on Flexible Comparisons.请参阅灵活比较部分。 .
. To summarise, we have
总而言之,我们有
╒════╤════════════╤════════════╕
│ │ Operator │ Function │
╞════╪════════════╪════════════╡
│ 0 │ > │ gt │
├────┼────────────┼────────────┤
│ 1 │ >= │ ge │
├────┼────────────┼────────────┤
│ 2 │ < │ lt │
├────┼────────────┼────────────┤
│ 3 │ <= │ le │
├────┼────────────┼────────────┤
│ 4 │ == │ eq │
├────┼────────────┼────────────┤
│ 5 │ != │ ne │
╘════╧════════════╧════════════╛
Another option for avoiding parentheses is to use DataFrame.query
(or eval
):避免括号的另一种选择是使用
DataFrame.query
(或eval
):
df.query('A < 5 and B > 5')
A B C
1 3 7 9
3 4 7 6
I have extensively documented query
and eval
in Dynamic Expression Evaluation in pandas using pd.eval() .我使用 pd.eval() 在 pandas 的动态表达式评估中广泛记录了
query
和eval
。
operator.and_
Allows you to perform this operation in a functional manner.允许您以功能方式执行此操作。 Internally calls
Series.__and__
which corresponds to the bitwise operator.内部调用
Series.__and__
对应于位运算符。
import operator
operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5)
0 False
1 True
2 False
3 True
4 False
dtype: bool
df[operator.and_(df['A'] < 5, df['B'] > 5)]
A B C
1 3 7 9
3 4 7 6
You won't usually need this, but it is useful to know.你通常不需要这个,但知道它很有用。
Generalizing: np.logical_and
(and logical_and.reduce
)概括:
np.logical_and
(和logical_and.reduce
)
Another alternative is using np.logical_and
, which also does not need parentheses grouping:另一种选择是使用
np.logical_and
,它也不需要括号分组:
np.logical_and(df['A'] < 5, df['B'] > 5)
0 False
1 True
2 False
3 True
4 False
Name: A, dtype: bool
df[np.logical_and(df['A'] < 5, df['B'] > 5)]
A B C
1 3 7 9
3 4 7 6
np.logical_and
is a ufunc (Universal Functions) , and most ufuncs have a reduce
method. np.logical_and
是一个np.logical_and
(通用函数) ,大多数 ufunc 都有一个reduce
方法。 This means it is easier to generalise with logical_and
if you have multiple masks to AND.这意味着如果您有多个 AND 掩码,则使用
logical_and
更容易进行概括。 For example, to AND masks m1
and m2
and m3
with &
, you would have to do例如,要和面具
m1
和m2
及m3
同&
,你就必须做
m1 & m2 & m3
However, an easier option is然而,一个更简单的选择是
np.logical_and.reduce([m1, m2, m3])
This is powerful, because it lets you build on top of this with more complex logic (for example, dynamically generating masks in a list comprehension and adding all of them):这很强大,因为它允许您在此之上构建更复杂的逻辑(例如,在列表推导中动态生成掩码并添加所有掩码):
import operator
cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]
m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m
# array([False, True, False, True, False])
df[m]
A B C
1 3 7 9
3 4 7 6
1 - I know I'm harping on this point, but please bear with me. 1 - 我知道我在强调这一点,但请耐心等待。 This is a very , very common beginner's mistake, and must be explained very thoroughly.
这是一个非常非常常见的初学者错误,必须非常彻底地解释。
For the df
above, say you'd like to return all rows where A == 3 or B == 7.对于上面的
df
,假设您要返回 A == 3 或 B == 7 的所有行。
Overloaded Bitwise |
按位重载
|
df['A'] == 3
0 False
1 True
2 True
3 False
4 False
Name: A, dtype: bool
df['B'] == 7
0 False
1 True
2 False
3 True
4 False
Name: B, dtype: bool
(df['A'] == 3) | (df['B'] == 7)
0 False
1 True
2 True
3 True
4 False
dtype: bool
df[(df['A'] == 3) | (df['B'] == 7)]
A B C
1 3 7 9
2 3 5 2
3 4 7 6
If you haven't yet, please also read the section on Logical AND above, all caveats apply here.如果您还没有,请同时阅读逻辑和上面的部分,所有注意事项都适用于此处。
Alternatively, this operation can be specified with或者,可以使用指定此操作
df[df['A'].eq(3) | df['B'].eq(7)]
A B C
1 3 7 9
2 3 5 2
3 4 7 6
operator.or_
Calls Series.__or__
under the hood.在
Series.__or__
调用Series.__or__
。
operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)
0 False
1 True
2 True
3 True
4 False
dtype: bool
df[operator.or_(df['A'] == 3, df['B'] == 7)]
A B C
1 3 7 9
2 3 5 2
3 4 7 6
np.logical_or
For two conditions, use logical_or
:对于两个条件,使用
logical_or
:
np.logical_or(df['A'] == 3, df['B'] == 7)
0 False
1 True
2 True
3 True
4 False
Name: A, dtype: bool
df[np.logical_or(df['A'] == 3, df['B'] == 7)]
A B C
1 3 7 9
2 3 5 2
3 4 7 6
For multiple masks, use logical_or.reduce
:对于多个掩码,请使用
logical_or.reduce
:
np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False, True, True, True, False])
df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]
A B C
1 3 7 9
2 3 5 2
3 4 7 6
Given a mask, such as给定一个掩码,例如
mask = pd.Series([True, True, False])
If you need to invert every boolean value (so that the end result is [False, False, True]
), then you can use any of the methods below.如果您需要反转每个布尔值(以便最终结果为
[False, False, True]
),那么您可以使用以下任何方法。
Bitwise ~
按位
~
~mask
0 False
1 False
2 True
dtype: bool
Again, expressions need to be parenthesised.同样,表达式需要用括号括起来。
~(df['A'] == 3)
0 True
1 False
2 False
3 True
4 True
Name: A, dtype: bool
This internally calls这在内部调用
mask.__invert__()
0 False
1 False
2 True
dtype: bool
But don't use it directly.但不要直接使用。
operator.inv
Internally calls __invert__
on the Series.在 Series 上内部调用
__invert__
。
operator.inv(mask)
0 False
1 False
2 True
dtype: bool
np.logical_not
This is the numpy variant.这是 numpy 变体。
np.logical_not(mask)
0 False
1 False
2 True
dtype: bool
Note, np.logical_and
can be substituted for np.bitwise_and
, logical_or
with bitwise_or
, and logical_not
with invert
.注意,
np.logical_and
可以取代np.bitwise_and
, logical_or
与bitwise_or
,和logical_not
与invert
。
Logical operators for boolean indexing in Pandas
Pandas 中布尔索引的逻辑运算符
It's important to realize that you cannot use any of the Python logical operators ( and
, or
or not
) on pandas.Series
or pandas.DataFrame
s (similarly you cannot use them on numpy.array
s with more than one element).重要的是要认识到您不能在
pandas.Series
或pandas.DataFrame
上使用任何 Python逻辑运算符( and
, or
or not
)(类似地,您不能在具有多个元素的numpy.array
上使用它们)。 The reason why you cannot use those is because they implicitly call bool
on their operands which throws an Exception because these data structures decided that the boolean of an array is ambiguous:您不能使用它们的原因是因为它们隐式调用
bool
在其操作数上引发异常,因为这些数据结构决定数组的布尔值是不明确的:
>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I did cover this more extensively in my answer to the "Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()" Q+A .我在回答“系列的真值是模棱两可的。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()”Q 的回答中确实更广泛地涵盖了这一点+一个。
However NumPy provides element-wise operating equivalents to these operators as functions that can be used on numpy.array
, pandas.Series
, pandas.DataFrame
, or any other (conforming) numpy.array
subclass:然而, NumPy提供了这些运算符的元素操作等效项,作为可用于
numpy.array
、 pandas.Series
、 pandas.DataFrame
或任何其他(符合) numpy.array
子类的numpy.array
:
and
has np.logical_and
and
有np.logical_and
or
has np.logical_or
or
有np.logical_or
not
has np.logical_not
not
np.logical_not
numpy.logical_xor
which has no Python equivalent, but it is a logical "exclusive or" operation numpy.logical_xor
没有 Python 等效项,但它是一个逻辑“异或”操作So, essentially, one should use (assuming df1
and df2
are Pandas DataFrames):所以,本质上,应该使用(假设
df1
和df2
是 Pandas DataFrames):
np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)
However in case you have boolean NumPy array, Pandas Series, or Pandas DataFrames you could also use the element-wise bitwise functions (for booleans they are - or at least should be - indistinguishable from the logical functions):但是,如果您有布尔 NumPy 数组、Pandas 系列或 Pandas DataFrames,您也可以使用逐元素位函数(对于布尔值,它们 - 或至少应该 - 与逻辑函数无法区分):
np.bitwise_and
or the &
operatornp.bitwise_and
或&
运算符np.bitwise_or
or the |
np.bitwise_or
或|
operatornp.invert
(or the alias np.bitwise_not
) or the ~
operatornp.invert
(或别名np.bitwise_not
)或~
运算符np.bitwise_xor
or the ^
operatornp.bitwise_xor
或^
运算符Typically the operators are used.通常使用运算符。 However when combined with comparison operators one has to remember to wrap the comparison in parenthesis because the bitwise operators have a higher precedence than the comparison operators :
但是,当与比较运算符结合使用时,必须记住将比较括在括号中,因为按位运算符的优先级高于比较运算符:
(df1 < 10) | (df2 > 10) # instead of the wrong df1 < 10 | df2 > 10
This may be irritating because the Python logical operators have a lower precedence than the comparison operators, so you normally write a < 10 and b > 10
(where a
and b
are for example simple integers) and don't need the parenthesis.这可能很烦人,因为 Python 逻辑运算符的优先级低于比较运算符,因此您通常编写
a < 10 and b > 10
(其中a
和b
是例如简单整数)并且不需要括号。
It is really important to stress that bit and logical operations are only equivalent for Boolean NumPy arrays (and boolean Series & DataFrames).强调位和逻辑操作仅对布尔 NumPy 数组(以及布尔系列和数据帧)是等效的,这一点非常重要。 If these don't contain Booleans then the operations will give different results.
如果这些不包含布尔值,那么操作将给出不同的结果。 I'll include examples using NumPy arrays, but the results will be similar for the pandas data structures:
我将包含使用 NumPy 数组的示例,但对于 pandas 数据结构,结果将是相似的:
>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])
>>> np.logical_and(a1, a2)
array([False, False, False, True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)
And since NumPy (and similarly Pandas) does different things for Boolean ( Boolean or “mask” index arrays ) and integer ( Index arrays ) indices the results of indexing will be also be different:由于 NumPy(和类似的 Pandas)对布尔( 布尔或“掩码”索引数组)和整数( 索引数组)索引做了不同的事情,索引的结果也会不同:
>>> a3 = np.array([1, 2, 3, 4])
>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])
Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
and | np.logical_and | np.bitwise_and | &
-------------------------------------------------------------------------------------
or | np.logical_or | np.bitwise_or | |
-------------------------------------------------------------------------------------
| np.logical_xor | np.bitwise_xor | ^
-------------------------------------------------------------------------------------
not | np.logical_not | np.invert | ~
Where the logical operator does not work for NumPy arrays , Pandas Series, and pandas DataFrames.逻辑运算符不适用于 NumPy 数组、Pandas Series 和 pandas DataFrames。 The others work on these data structures (and plain Python objects) and work element-wise.
其他人处理这些数据结构(和纯 Python 对象)并按元素工作。 However, be careful with the bitwise invert on plain Python
bool
s because the bool will be interpreted as integers in this context (for example ~False
returns -1
and ~True
returns -2
).但是,请注意普通 Python
bool
上的按位反转,因为在这种情况下 bool 将被解释为整数(例如~False
返回-1
和~True
返回-2
)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.