[英]Applying series of conditions to dataframe. Pandas
I used to work with np.where function when applying multiple conditions to dataframe and feel ok in using it.在对 dataframe 应用多个条件时,我曾经使用 np.where function 并感觉可以使用它。 I would like to improve my code where the same condition is repeated in each set of conditions in np.where and I do not know how to do it in most simple (clear and concise manner), either using (1) .loc or (2) IF "condition" DO " apply other conditions"
我想改进我的代码,其中在 np.where 中的每组条件中重复相同的条件,我不知道如何以最简单(清晰简洁的方式)来做到这一点,或者使用 (1) .loc或 ( 2) IF "条件" DO "应用其他条件"
Example:例子:
I need to select only rows where "Date" is under condition (eg. >20200201) and only for these rows, calculate new column, applying another set of different conditions (eg. condition 1: A >20 and B >20; condition 2: A==30 and B==10, condition 3: A==20 and B>=10 etc)我需要 select 仅在“日期”处于条件下的行(例如>20200201)并且仅对于这些行,计算新列,应用另一组不同的条件(例如条件1:A> 20和B> 20;条件2:A==30 和 B==10,条件 3:A==20 和 B>=10 等)
My question what will be the best way to make the first selection (Data >20200202) to not repeat Date>2020201 in every line and avoid this:我的问题是什么是进行第一次选择(数据> 20200202)而不是在每一行中重复日期> 2020201并避免这种情况的最佳方法:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [1,3,2,2,3,1,3,2],
"Date": [20200109, 20200204, 20200307, 20200216, 20200107, 20200108, 20200214, 20200314],
"A": [20,10,40,40,10,20, 40,30],
"B": [20,30,40,50,20, 30, 20, 10]})
df['new']=np.nan
df['new']=np.where((df['Date']>20200201) & (df['A']>20) & (df['B']>20), 'value', df['new'])
df['new']=np.where((df['Date']>20200201) & (df['A']==30) & (df['B']==10), 'value', df['new'])
df['new']=np.where((df['Date']>20200201) & (df['A']==20) & (df['B']>=10), 'value', df['new'])
Looks like you can use np.select
:看起来你可以使用
np.select
:
s1 = df.Date <= 20200201
s2 = (df['A'] > 20) & df['B'].gt(20)
s3 = df['A'].eq(30) & df['B'].eq(10)
s4 = df['A'].eq(20) & df['B'].ge(10)
df['new'] = np.select( (s1,s2|s3|s4), (np.nan, 'value'), np.nan)
Output: Output:
ID Date A B new
0 1 20200109 20 20 nan
1 3 20200204 10 30 nan
2 2 20200307 40 40 value
3 2 20200216 40 50 value
4 3 20200107 10 20 nan
5 1 20200108 20 30 nan
6 3 20200214 40 20 nan
7 2 20200314 30 10 value
It is probably not the fastest solution, but its advantage is readability and easy maintenance (in the future).它可能不是最快的解决方案,但它的优点是可读性和易于维护(将来)。
Find rows in question using query and the indices of these rows:使用查询和这些行的索引查找有问题的行:
ind = df.query('Date > 20200201 and (A > 20 and B > 20 or ' 'A == 30 and B == 10 or A == 20 and B >= 10)').index
Save new value in new column, in the indicated rows:在新列中保存新值,在指示的行中:
df.loc[ind, 'new'] = 'value'; df
Other values in this column remain NaN .此列中的其他值仍为NaN 。
If in the future something changes in the above condition, it is quite easy and intuitive to correct it.如果将来上述情况发生变化,纠正它是相当容易和直观的。
So unless your data volume is very big and the execution time is prohibitively long, this solution is worth to consider.因此,除非您的数据量非常大并且执行时间过长,否则该解决方案值得考虑。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.