[英]How do I assign values based on multiple conditions for existing columns?
I would like to create a new column with a numerical value based on the following conditions:我想根据以下条件创建一个具有数值的新列:
a.一个。
if gender is male & pet1==pet2, points = 5
b.湾。
if gender is female & (pet1 is 'cat' or pet1 is 'dog'), points = 5
c. c。 all other combinations,
points = 0
所有其他组合,
points = 0
gender pet1 pet2
0 male dog dog
1 male cat cat
2 male dog cat
3 female cat squirrel
4 female dog dog
5 female squirrel cat
6 squirrel dog cat
I would like the end result to be as follows:我希望最终结果如下:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
How do I accomplish this?我该如何做到这一点?
You can do this using np.where
, the conditions use bitwise &
and |
您可以使用
np.where
执行此操作,条件使用按位&
和|
for and
and or
with parentheses around the multiple conditions due to operator precedence.由于运算符优先级,for
and
和or
在多个条件周围加上括号。 So where the condition is true 5
is returned and 0
otherwise:因此,如果条件为真,则返回
5
,否则返回0
:
In [29]:
df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
df
Out[29]:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
numpy.select
2020 answer 2020 答案
This is a perfect case for np.select
where we can create a column based on multiple conditions and it's a readable method when there are more conditions:这是
np.select
一个完美案例,我们可以根据多个条件创建一个列,当有更多条件时,这是一种可读的方法:
conditions = [
df['gender'].eq('male') & df['pet1'].eq(df['pet2']),
df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog'])
]
choices = [5,5]
df['points'] = np.select(conditions, choices, default=0)
print(df)
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
The apply method described by @RuggeroTurra takes a lot longer for 500k rows. @RuggeroTurra 描述的 apply 方法对于 500k 行需要更长的时间。 I ended up using something like
我最终使用了类似的东西
df['result'] = ((df.a == 0) & (df.b != 1)).astype(int) * 2 + \
((df.a != 0) & (df.b != 1)).astype(int) * 3 + \
((df.a == 0) & (df.b == 1)).astype(int) * 4 + \
((df.a != 0) & (df.b == 1)).astype(int) * 5
where the apply method took 25 seconds and this method above took about 18ms.其中 apply 方法需要 25 秒,上面的方法需要大约 18 毫秒。
You can also use the apply
function.您还可以使用
apply
功能。 For example:例如:
def myfunc(gender, pet1, pet2):
if gender=='male' and pet1==pet2:
myvalue=5
elif gender=='female' and (pet1=='cat' or pet1=='dog'):
myvalue=5
else:
myvalue=0
return myvalue
And then using the apply function by setting axis=1
然后通过设置
axis=1
来使用apply函数
df['points'] = df.apply(lambda x: myfunc(x['gender'], x['pet1'], x['pet2']), axis=1)
We get:我们得到:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
One option is with case_when from pyjanitor ;一种选择是使用pyjanitor的case_when ; under the hood it uses
pd.Series.mask
.在引擎盖下它使用
pd.Series.mask
。
The basic idea is a pairing of condition and expected value;基本思想是条件和期望值的配对; you can pass as many pairings as required, followed by a default value and a target column name:
您可以根据需要传递任意数量的配对,后跟默认值和目标列名称:
# pip install pyjanitor
import pandas as pd
import janitor
df.case_when(
# condition, value
df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
0, # default
column_name = 'points')
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
You could use strings for the conditions, as long as they can be evaluated by pd.eval
on the parent dataframe - note that speed wise, this can be slower for small datasets:您可以使用字符串作为条件,只要它们可以由父
pd.eval
上的 pd.eval 评估 - 请注意速度方面,这对于小型数据集可能会更慢:
df.case_when(
"gender == 'male' and pet1 == pet2", 5,
"gender == 'female' and pet2 == ['cat', 'dog']", 5,
0,
column_name = 'points')
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 0
4 female dog dog 5
5 female squirrel cat 5
6 squirrel dog cat 0
Anonymous functions are also possible, which can be handy in chained operations:匿名函数也是可能的,这在链式操作中很方便:
df.case_when(
lambda df: df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
lambda df: df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
0, # default
column_name = 'points')
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
Writing the conditions as a string expression and evaluating it using eval()
is another method to evaluate the condition and assign values to the column using numpy.where()
.将条件写入字符串表达式并使用
eval()
对其进行评估是另一种评估条件并使用numpy.where()
为列分配值的方法。
# evaluate the condition
condition = df.eval("gender=='male' and pet1==pet2 or gender=='female' and pet1==['cat','dog']")
# assign values
df['points'] = np.where(condition, 5, 0)
If you have a large dataframe (100k+ rows) and a lot of comparisons to evaluate, this method is probably the fastest pandas method to construct a boolean mask.如果你有一个大的 dataframe(100k+ 行)并且有很多比较需要评估,这个方法可能是最快的 pandas 方法来构建一个 boolean 掩码。 1
1
Another advantage of this method over chained &
and/or |
这种方法相对于链式
&
和/或|
的另一个优点operators (used in the other vectorized answers here) is better readability (arguably).运算符(在此处的其他矢量化答案中使用)具有更好的可读性(可以说)。
1 : For a dataframe with 105k rows, if you evaluate 4 conditions where each chain two comparisons, eval()
creates a boolean mask substantially faster than chaining bitwise operators. 1 :对于具有 105k 行的 dataframe,如果您评估 4 个条件,其中每个链接两个比较,
eval()
会创建一个 boolean 掩码,速度比链接按位运算符快得多。
df = pd.DataFrame([{'gender': 'male', 'pet1': 'dog', 'pet2': 'dog'}, {'gender': 'male', 'pet1': 'cat', 'pet2': 'cat'}, {'gender': 'male', 'pet1': 'dog', 'pet2': 'cat'},{'gender': 'female', 'pet1': 'cat', 'pet2': 'squirrel'},{'gender': 'female', 'pet1': 'dog', 'pet2': 'dog'},{'gender': 'female', 'pet1': 'squirrel', 'pet2': 'cat'},{'gender': 'squirrel', 'pet1': 'dog', 'pet2': 'cat'}]*15_000)
%timeit np.where(df.eval("gender == 'male' and pet1 == pet2 or gender == 'female' and pet1 == ['cat','dog'] or gender == 'female' and pet2 == ['squirrel','dog'] or pet1 == 'cat' and pet2 == 'cat'"), 5, 0)
# 37.9 ms ± 847 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ) | ( (df['gender'] == 'female') & (df['pet2'].isin(['squirrel','dog'] ) ) ) | ( (df['pet1'] == 'cat') & (df['pet2'] == 'cat') ), 5, 0)
# 53.5 ms ± 1.38 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit np.select([df['gender'].eq('male') & df['pet1'].eq(df['pet2']), df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog']), df['gender'].eq('female') & df['pet2'].isin(['squirrel', 'dog']), df['pet1'].eq('cat') & df['pet2'].eq('cat')], [5,5,5,5], default=0)
# 48.9 ms ± 5.06 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.