简体   繁体   English

如何根据现有列的多个条件分配值?

[英]How do I assign values based on multiple conditions for existing columns?

I would like to create a new column with a numerical value based on the following conditions:我想根据以下条件创建一个具有数值的新列:

a.一个。 if gender is male & pet1==pet2, points = 5

b.湾。 if gender is female & (pet1 is 'cat' or pet1 is 'dog'), points = 5

c. c。 all other combinations, points = 0所有其他组合, points = 0

    gender    pet1      pet2
0   male      dog       dog
1   male      cat       cat
2   male      dog       cat
3   female    cat       squirrel
4   female    dog       dog
5   female    squirrel  cat
6   squirrel  dog       cat

I would like the end result to be as follows:我希望最终结果如下:

    gender    pet1      pet2      points
0   male      dog       dog       5
1   male      cat       cat       5
2   male      dog       cat       0
3   female    cat       squirrel  5
4   female    dog       dog       5
5   female    squirrel  cat       0
6   squirrel  dog       cat       0

How do I accomplish this?我该如何做到这一点?

You can do this using np.where , the conditions use bitwise & and |您可以使用np.where执行此操作,条件使用按位&| for and and or with parentheses around the multiple conditions due to operator precedence.由于运算符优先级,for andor在多个条件周围加上括号。 So where the condition is true 5 is returned and 0 otherwise:因此,如果条件为真,则返回5 ,否则返回0

In [29]:
df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
df

Out[29]:
     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

numpy.select

2020 answer 2020 答案

This is a perfect case for np.select where we can create a column based on multiple conditions and it's a readable method when there are more conditions:这是np.select一个完美案例,我们可以根据多个条件创建一个列,当有更多条件时,这是一种可读的方法:

conditions = [
    df['gender'].eq('male') & df['pet1'].eq(df['pet2']),
    df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog'])
]

choices = [5,5]

df['points'] = np.select(conditions, choices, default=0)

print(df)
     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

using apply .使用apply

def f(x):
  if x['gender'] == 'male' and x['pet1'] == x['pet2']: return 5
  elif x['gender'] == 'female' and (x['pet1'] == 'cat' or x['pet1'] == 'dog'): return 5
  else: return 0

data['points'] = data.apply(f, axis=1)

The apply method described by @RuggeroTurra takes a lot longer for 500k rows. @RuggeroTurra 描述的 apply 方法对于 500k 行需要更长的时间。 I ended up using something like我最终使用了类似的东西

df['result'] = ((df.a == 0) & (df.b != 1)).astype(int) * 2 + \
               ((df.a != 0) & (df.b != 1)).astype(int) * 3 + \
               ((df.a == 0) & (df.b == 1)).astype(int) * 4 + \
               ((df.a != 0) & (df.b == 1)).astype(int) * 5 

where the apply method took 25 seconds and this method above took about 18ms.其中 apply 方法需要 25 秒,上面的方法需要大约 18 毫秒。

You can also use the apply function.您还可以使用apply功能。 For example:例如:

def myfunc(gender, pet1, pet2):
    if gender=='male' and pet1==pet2:
        myvalue=5
    elif gender=='female' and (pet1=='cat' or pet1=='dog'):
        myvalue=5
    else:
        myvalue=0
    return myvalue

And then using the apply function by setting axis=1然后通过设置axis=1来使用apply函数

df['points'] = df.apply(lambda x: myfunc(x['gender'], x['pet1'], x['pet2']), axis=1)

We get:我们得到:

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

One option is with case_when from pyjanitor ;一种选择是使用pyjanitorcase_when under the hood it uses pd.Series.mask .在引擎盖下它使用pd.Series.mask

The basic idea is a pairing of condition and expected value;基本思想是条件和期望值的配对; you can pass as many pairings as required, followed by a default value and a target column name:您可以根据需要传递任意数量的配对,后跟默认值和目标列名称:

# pip install pyjanitor
import pandas as pd
import janitor
df.case_when(
    # condition, value
    df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
    df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
    0, # default
    column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

You could use strings for the conditions, as long as they can be evaluated by pd.eval on the parent dataframe - note that speed wise, this can be slower for small datasets:您可以使用字符串作为条件,只要它们可以由父pd.eval上的 pd.eval 评估 - 请注意速度方面,这对于小型数据集可能会更慢:

df.case_when(
   "gender == 'male' and pet1 == pet2", 5,
   "gender == 'female' and pet2 == ['cat', 'dog']", 5,
   0,
   column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       0
4    female       dog       dog       5
5    female  squirrel       cat       5
6  squirrel       dog       cat       0

Anonymous functions are also possible, which can be handy in chained operations:匿名函数也是可能的,这在链式操作中很方便:

df.case_when(
    lambda df: df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
    lambda df: df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
    0, # default
    column_name = 'points')

     gender      pet1      pet2  points
0      male       dog       dog       5
1      male       cat       cat       5
2      male       dog       cat       0
3    female       cat  squirrel       5
4    female       dog       dog       5
5    female  squirrel       cat       0
6  squirrel       dog       cat       0

Writing the conditions as a string expression and evaluating it using eval() is another method to evaluate the condition and assign values to the column using numpy.where() .将条件写入字符串表达式并使用eval()对其进行评估是另一种评估条件并使用numpy.where()为列分配值的方法。

# evaluate the condition 
condition = df.eval("gender=='male' and pet1==pet2 or gender=='female' and pet1==['cat','dog']")
# assign values
df['points'] = np.where(condition, 5, 0)

If you have a large dataframe (100k+ rows) and a lot of comparisons to evaluate, this method is probably the fastest pandas method to construct a boolean mask.如果你有一个大的 dataframe(100k+ 行)并且有很多比较需要评估,这个方法可能是最快的 pandas 方法来构建一个 boolean 掩码。 1 1

Another advantage of this method over chained & and/or |这种方法相对于链式&和/或|的另一个优点operators (used in the other vectorized answers here) is better readability (arguably).运算符(在此处的其他矢量化答案中使用)具有更好的可读性(可以说)。


1 : For a dataframe with 105k rows, if you evaluate 4 conditions where each chain two comparisons, eval() creates a boolean mask substantially faster than chaining bitwise operators. 1 :对于具有 105k 行的 dataframe,如果您评估 4 个条件,其中每个链接两个比较, eval()会创建一个 boolean 掩码,速度比链接按位运算符快得多。

df = pd.DataFrame([{'gender': 'male', 'pet1': 'dog', 'pet2': 'dog'}, {'gender': 'male', 'pet1': 'cat', 'pet2': 'cat'}, {'gender': 'male', 'pet1': 'dog', 'pet2': 'cat'},{'gender': 'female', 'pet1': 'cat', 'pet2': 'squirrel'},{'gender': 'female', 'pet1': 'dog', 'pet2': 'dog'},{'gender': 'female', 'pet1': 'squirrel', 'pet2': 'cat'},{'gender': 'squirrel', 'pet1': 'dog', 'pet2': 'cat'}]*15_000)

%timeit np.where(df.eval("gender == 'male' and pet1 == pet2 or gender == 'female' and pet1 == ['cat','dog'] or gender == 'female' and pet2 == ['squirrel','dog'] or pet1 == 'cat' and pet2 == 'cat'"), 5, 0)
# 37.9 ms ± 847 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ) | ( (df['gender'] == 'female') & (df['pet2'].isin(['squirrel','dog'] ) ) ) | ( (df['pet1'] == 'cat') & (df['pet2'] == 'cat') ), 5, 0)
# 53.5 ms ± 1.38 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit np.select([df['gender'].eq('male') & df['pet1'].eq(df['pet2']), df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog']), df['gender'].eq('female') & df['pet2'].isin(['squirrel', 'dog']), df['pet1'].eq('cat') & df['pet2'].eq('cat')], [5,5,5,5], default=0)
# 48.9 ms ± 5.06 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM