对 dataframe 应用一系列条件。 Pandas

Question

I used to work with np.where function when applying multiple conditions to dataframe and feel ok in using it.在对 dataframe 应用多个条件时，我曾经使用 np.where function 并感觉可以使用它。 I would like to improve my code where the same condition is repeated in each set of conditions in np.where and I do not know how to do it in most simple (clear and concise manner), either using (1) .loc or (2) IF "condition" DO " apply other conditions"我想改进我的代码，其中在 np.where 中的每组条件中重复相同的条件，我不知道如何以最简单（清晰简洁的方式）来做到这一点，或者使用 (1) .loc或 ( 2) IF "条件" DO "应用其他条件"

Example:例子：

I need to select only rows where "Date" is under condition (eg. >20200201) and only for these rows, calculate new column, applying another set of different conditions (eg. condition 1: A >20 and B >20; condition 2: A==30 and B==10, condition 3: A==20 and B>=10 etc)我需要 select 仅在“日期”处于条件下的行（例如>20200201）并且仅对于这些行，计算新列，应用另一组不同的条件（例如条件1：A> 20和B> 20；条件2：A==30 和 B==10，条件 3：A==20 和 B>=10 等）

My question what will be the best way to make the first selection (Data >20200202) to not repeat Date>2020201 in every line and avoid this:我的问题是什么是进行第一次选择（数据> 20200202）而不是在每一行中重复日期> 2020201并避免这种情况的最佳方法：

import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [1,3,2,2,3,1,3,2],
           "Date": [20200109, 20200204, 20200307, 20200216, 20200107, 20200108, 20200214, 20200314],
           "A": [20,10,40,40,10,20, 40,30], 
           "B": [20,30,40,50,20, 30, 20, 10]})

df['new']=np.nan
df['new']=np.where((df['Date']>20200201) & (df['A']>20) & (df['B']>20), 'value', df['new'])
df['new']=np.where((df['Date']>20200201) & (df['A']==30) & (df['B']==10), 'value', df['new'])
df['new']=np.where((df['Date']>20200201) & (df['A']==20) & (df['B']>=10), 'value', df['new'])

Answer 1

Looks like you can use np.select :看起来你可以使用np.select ：

s1 = df.Date <= 20200201
s2 = (df['A'] > 20) & df['B'].gt(20)
s3 = df['A'].eq(30) & df['B'].eq(10)
s4 = df['A'].eq(20) & df['B'].ge(10)

df['new'] = np.select( (s1,s2|s3|s4), (np.nan, 'value'), np.nan)

Output: Output：

   ID      Date   A   B    new
0   1  20200109  20  20    nan
1   3  20200204  10  30    nan
2   2  20200307  40  40  value
3   2  20200216  40  50  value
4   3  20200107  10  20    nan
5   1  20200108  20  30    nan
6   3  20200214  40  20    nan
7   2  20200314  30  10  value

Answer 2

It is probably not the fastest solution, but its advantage is readability and easy maintenance (in the future).它可能不是最快的解决方案，但它的优点是可读性和易于维护（将来）。

Find rows in question using query and the indices of these rows:使用查询和这些行的索引查找有问题的行：
```
 ind = df.query('Date > 20200201 and (A > 20 and B > 20 or ' 'A == 30 and B == 10 or A == 20 and B >= 10)').index
```
Save new value in new column, in the indicated rows:在新列中保存新值，在指示的行中：
```
 df.loc[ind, 'new'] = 'value'; df
```

Other values in this column remain NaN .此列中的其他值仍为NaN 。

If in the future something changes in the above condition, it is quite easy and intuitive to correct it.如果将来上述情况发生变化，纠正它是相当容易和直观的。

So unless your data volume is very big and the execution time is prohibitively long, this solution is worth to consider.因此，除非您的数据量非常大并且执行时间过长，否则该解决方案值得考虑。

对 dataframe 应用一系列条件。 Pandas

问题描述

2 个解决方案

解决方案1
2 2020-05-26 11:05:32

解决方案2
0 2020-05-26 11:17:52

对 dataframe 应用一系列条件。 Pandas

问题描述

2 个解决方案

解决方案1 2 2020-05-26 11:05:32

解决方案2 0 2020-05-26 11:17:52

解决方案1
2 2020-05-26 11:05:32

解决方案2
0 2020-05-26 11:17:52