如何根据现有列的多个条件分配值？

Question

I would like to create a new column with a numerical value based on the following conditions:我想根据以下条件创建一个具有数值的新列：

a.一个。 if gender is male & pet1==pet2, points = 5

b.湾。 if gender is female & (pet1 is 'cat' or pet1 is 'dog'), points = 5

c. c。 all other combinations, points = 0所有其他组合， points = 0

    gender    pet1      pet2
0   male      dog       dog
1   male      cat       cat
2   male      dog       cat
3   female    cat       squirrel
4   female    dog       dog
5   female    squirrel  cat
6   squirrel  dog       cat

I would like the end result to be as follows:我希望最终结果如下：

    gender    pet1      pet2      points
0   male      dog       dog       5
1   male      cat       cat       5
2   male      dog       cat       0
3   female    cat       squirrel  5
4   female    dog       dog       5
5   female    squirrel  cat       0
6   squirrel  dog       cat       0

How do I accomplish this?我该如何做到这一点？

Answer 1

You can do this using np.where , the conditions use bitwise & and |您可以使用np.where执行此操作，条件使用按位&和| for and and or with parentheses around the multiple conditions due to operator precedence.由于运算符优先级，for and和or在多个条件周围加上括号。 So where the condition is true 5 is returned and 0 otherwise:因此，如果条件为真，则返回5 ，否则返回0 ：

In [29]:
df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
df

Out[29]:
     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

Answer 2

`numpy.select`

2020 answer 2020 答案

This is a perfect case for np.select where we can create a column based on multiple conditions and it's a readable method when there are more conditions:这是np.select一个完美案例，我们可以根据多个条件创建一个列，当有更多条件时，这是一种可读的方法：

conditions = [
    df['gender'].eq('male') & df['pet1'].eq(df['pet2']),
    df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog'])
]

choices = [5,5]

df['points'] = np.select(conditions, choices, default=0)

print(df)
     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

Answer 3

using apply .使用apply 。

def f(x):
  if x['gender'] == 'male' and x['pet1'] == x['pet2']: return 5
  elif x['gender'] == 'female' and (x['pet1'] == 'cat' or x['pet1'] == 'dog'): return 5
  else: return 0

data['points'] = data.apply(f, axis=1)

Answer 4

The apply method described by @RuggeroTurra takes a lot longer for 500k rows. @RuggeroTurra 描述的 apply 方法对于 500k 行需要更长的时间。 I ended up using something like我最终使用了类似的东西

df['result'] = ((df.a == 0) & (df.b != 1)).astype(int) * 2 + \
               ((df.a != 0) & (df.b != 1)).astype(int) * 3 + \
               ((df.a == 0) & (df.b == 1)).astype(int) * 4 + \
               ((df.a != 0) & (df.b == 1)).astype(int) * 5

where the apply method took 25 seconds and this method above took about 18ms.其中 apply 方法需要 25 秒，上面的方法需要大约 18 毫秒。

Answer 5

You can also use the apply function.您还可以使用apply功能。 For example:例如：

def myfunc(gender, pet1, pet2):
    if gender=='male' and pet1==pet2:
        myvalue=5
    elif gender=='female' and (pet1=='cat' or pet1=='dog'):
        myvalue=5
    else:
        myvalue=0
    return myvalue

And then using the apply function by setting axis=1然后通过设置axis=1来使用apply函数

df['points'] = df.apply(lambda x: myfunc(x['gender'], x['pet1'], x['pet2']), axis=1)

We get:我们得到：

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

Answer 6

One option is with case_when from pyjanitor ;一种选择是使用pyjanitor的case_when ； under the hood it uses pd.Series.mask .在引擎盖下它使用pd.Series.mask 。

The basic idea is a pairing of condition and expected value;基本思想是条件和期望值的配对； you can pass as many pairings as required, followed by a default value and a target column name:您可以根据需要传递任意数量的配对，后跟默认值和目标列名称：

# pip install pyjanitor
import pandas as pd
import janitor
df.case_when(
    # condition, value
    df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
    df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
    0, # default
    column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

You could use strings for the conditions, as long as they can be evaluated by pd.eval on the parent dataframe - note that speed wise, this can be slower for small datasets:您可以使用字符串作为条件，只要它们可以由父pd.eval上的 pd.eval 评估 - 请注意速度方面，这对于小型数据集可能会更慢：

df.case_when(
   "gender == 'male' and pet1 == pet2", 5,
   "gender == 'female' and pet2 == ['cat', 'dog']", 5,
   0,
   column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       0
4    female       dog       dog       5
5    female  squirrel       cat       5
6  squirrel       dog       cat       0

Anonymous functions are also possible, which can be handy in chained operations:匿名函数也是可能的，这在链式操作中很方便：

df.case_when(
    lambda df: df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
    lambda df: df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
    0, # default
    column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

Answer 7

Writing the conditions as a string expression and evaluating it using eval() is another method to evaluate the condition and assign values to the column using numpy.where() .将条件写入字符串表达式并使用eval()对其进行评估是另一种评估条件并使用numpy.where()为列分配值的方法。

# evaluate the condition 
condition = df.eval("gender=='male' and pet1==pet2 or gender=='female' and pet1==['cat','dog']")
# assign values
df['points'] = np.where(condition, 5, 0)

If you have a large dataframe (100k+ rows) and a lot of comparisons to evaluate, this method is probably the fastest pandas method to construct a boolean mask.如果你有一个大的 dataframe（100k+ 行）并且有很多比较需要评估，这个方法可能是最快的 pandas 方法来构建一个 boolean 掩码。 ¹ ¹

Another advantage of this method over chained & and/or |这种方法相对于链式&和/或|的另一个优点operators (used in the other vectorized answers here) is better readability (arguably).运算符（在此处的其他矢量化答案中使用）具有更好的可读性（可以说）。

¹ : For a dataframe with 105k rows, if you evaluate 4 conditions where each chain two comparisons, eval() creates a boolean mask substantially faster than chaining bitwise operators. ¹ ：对于具有 105k 行的 dataframe，如果您评估 4 个条件，其中每个链接两个比较， eval()会创建一个 boolean 掩码，速度比链接按位运算符快得多。

df = pd.DataFrame([{'gender': 'male', 'pet1': 'dog', 'pet2': 'dog'}, {'gender': 'male', 'pet1': 'cat', 'pet2': 'cat'}, {'gender': 'male', 'pet1': 'dog', 'pet2': 'cat'},{'gender': 'female', 'pet1': 'cat', 'pet2': 'squirrel'},{'gender': 'female', 'pet1': 'dog', 'pet2': 'dog'},{'gender': 'female', 'pet1': 'squirrel', 'pet2': 'cat'},{'gender': 'squirrel', 'pet1': 'dog', 'pet2': 'cat'}]*15_000)

%timeit np.where(df.eval("gender == 'male' and pet1 == pet2 or gender == 'female' and pet1 == ['cat','dog'] or gender == 'female' and pet2 == ['squirrel','dog'] or pet1 == 'cat' and pet2 == 'cat'"), 5, 0)
# 37.9 ms ± 847 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ) | ( (df['gender'] == 'female') & (df['pet2'].isin(['squirrel','dog'] ) ) ) | ( (df['pet1'] == 'cat') & (df['pet2'] == 'cat') ), 5, 0)
# 53.5 ms ± 1.38 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit np.select([df['gender'].eq('male') & df['pet1'].eq(df['pet2']), df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog']), df['gender'].eq('female') & df['pet2'].isin(['squirrel', 'dog']), df['pet1'].eq('cat') & df['pet2'].eq('cat')], [5,5,5,5], default=0)
# 48.9 ms ± 5.06 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

如何根据现有列的多个条件分配值？

问题描述

7 个解决方案

解决方案1
41 已采纳 2015-06-03 23:04:43

解决方案2
31 2020-02-16 01:52:44

`numpy.select`

解决方案3
19 2015-06-03 22:54:46

解决方案4
5 2018-05-14 21:51:13

解决方案5
4 2020-05-11 12:50:23

解决方案6
0 2022-03-24 10:46:15

解决方案7
0 2022-09-15 09:05:02

如何根据现有列的多个条件分配值？

问题描述

7 个解决方案

解决方案1 41 已采纳 2015-06-03 23:04:43

解决方案2 31 2020-02-16 01:52:44

numpy.select

解决方案3 19 2015-06-03 22:54:46

解决方案4 5 2018-05-14 21:51:13

解决方案5 4 2020-05-11 12:50:23

解决方案6 0 2022-03-24 10:46:15

解决方案7 0 2022-09-15 09:05:02

解决方案1
41 已采纳 2015-06-03 23:04:43

解决方案2
31 2020-02-16 01:52:44

`numpy.select`

解决方案3
19 2015-06-03 22:54:46

解决方案4
5 2018-05-14 21:51:13

解决方案5
4 2020-05-11 12:50:23

解决方案6
0 2022-03-24 10:46:15

解决方案7
0 2022-09-15 09:05:02