[英]How to create new column in a df based on multiple conditions?
I have a df with 3 columns: v1, v2, v3;where 我有一个包含3列的df:v1,v2,v3;其中
v1=[a,b,c,a]
v2=[d,d,f,n]
v3=[a,k,i,j]
What I like to do is to create new columns based on conditions in column v1~v3. 我喜欢做的是根据第v1~v3列中的条件创建新列。
I can do single condition, 我可以做单一的条件,
df['v1_a']=np.where(df['v1']=='a',1,0)
it gives a new column named 'v1_a'
with 1/0
它给出了一个名为'v1_a'
的新列'v1_a'
包含1/0
However, if I want to create a new column based on multiple conditions, this does not work: 但是,如果我想基于多个条件创建新列,则不起作用:
df['v2_flag']=np.where(df['v2']=='f' or df['v2']=='h',1,0)
How can I accomplish this? 我怎么能做到这一点?
If you use multiple condition you'll get the following ValueError
because np.where()
doesn't accept multiple condition : 如果使用多个条件,则会得到以下ValueError
因为np.where()
不接受多个条件:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
So in your I suggest to use np.logical_or
. 所以在你的建议中我建议使用np.logical_or
。
df['v2_flag']=np.where(np.logical_or(df['v2']=='f',df['v2']=='h'),1,0)
See the following example too: 请参阅以下示例:
>>> a=np.array([2,2,2,5,7,8,1,4,2,3,4,5,6])
>>> np.where(np.logical_or(a==5,a==2),a,0)
array([2, 2, 2, 5, 0, 0, 0, 0, 2, 0, 0, 5, 0])
In python and
and or
can only give a single result and can't be overridden to have other purposes by modules like the giant row by row comparison you're trying to do. 在python and
和or
只能提供单个结果,并且不能被重写以具有其他目的,例如你想要做的巨大的逐行比较。
You need to use the symbolic &
(and) and |
你需要使用符号&
(和)和|
(or), which are normally used for bit-wise comparisons. (或),通常用于逐位比较。 These have been re-purposed by pandas to be a row by row comparison, which actually makes sense as being analogous to bit-wise comparisons. 这些已经被大熊猫重新定位为逐行比较,这实际上是有道理的,因为它类似于逐位比较。 That is more of a happy coincidence though, as these were mainly used because these can be overridden by the modules. 然而,这更令人高兴,因为这些主要是因为这些可以被模块覆盖。
Because of the priority of these and equalities, you'll need parentheses around each term or else it would calculate the |
由于这些和平等的优先级,你需要在每个术语周围括号,否则它将计算|
before the ==
which isn't what you want. 在==
之前,这不是你想要的。 You can use something like this: 你可以使用这样的东西:
df['v2_flag']=np.where((df['v2']=='f')|(df['v2']=='h'),1,0)
df['v2']=='f' or df['v2']=='h'
raises the ValueError before it gets to np.where
. df['v2']=='f' or df['v2']=='h'
在它到达np.where
之前引发ValueError。 The or
causes Python to evaluate df['v2']=='f'
and df['v2']=='h'
in a boolean context. or
导致Python在布尔上下文中评估df['v2']=='f'
和df['v2']=='h'
。 But Pandas Series
, like NumPy arrays, refuse to be reduce to a single boolean value -- they raise a ValueError instead . 但是Pandas Series
和NumPy数组一样,拒绝减少到一个布尔值 - 它们会引发一个ValueError 。
To fix your code, you could use 要修复您的代码,您可以使用
df['v2_flag'] = np.where( (df['v2']=='f') | (df['v2']=='h'), 1, 0)
The |
|
performs bitwise-or element-wise over the two boolean-valued Series. 在两个布尔值系列上执行按位或元素方式。
Other ways to define df['v2_flag']
include 定义df['v2_flag']
其他方法包括
df['v2_flag'] = ((df['v2']=='f') | (df['v2']=='h')).astype(int)
or 要么
df['v2_flag'] = df['v2'].isin(['f', 'h']).astype(int)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.