简体   繁体   English

对 dataframe 应用一系列条件。 Pandas

[英]Applying series of conditions to dataframe. Pandas

I used to work with np.where function when applying multiple conditions to dataframe and feel ok in using it.在对 dataframe 应用多个条件时,我曾经使用 np.where function 并感觉可以使用它。 I would like to improve my code where the same condition is repeated in each set of conditions in np.where and I do not know how to do it in most simple (clear and concise manner), either using (1) .loc or (2) IF "condition" DO " apply other conditions"我想改进我的代码,其中在 np.where 中的每组条件中重复相同的条件,我不知道如何以最简单(清晰简洁的方式)来做到这一点,或者使用 (1) .loc或 ( 2) IF "条件" DO "应用其他条件"

Example:例子:

I need to select only rows where "Date" is under condition (eg. >20200201) and only for these rows, calculate new column, applying another set of different conditions (eg. condition 1: A >20 and B >20; condition 2: A==30 and B==10, condition 3: A==20 and B>=10 etc)我需要 select 仅在“日期”处于条件下的行(例如>20200201)并且仅对于这些行,计算新列,应用另一组不同的条件(例如条件1:A> 20和B> 20;条件2:A==30 和 B==10,条件 3:A==20 和 B>=10 等)

My question what will be the best way to make the first selection (Data >20200202) to not repeat Date>2020201 in every line and avoid this:我的问题是什么是进行第一次选择(数据> 20200202)而不是在每一行中重复日期> 2020201并避免这种情况的最佳方法

import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [1,3,2,2,3,1,3,2],
           "Date": [20200109, 20200204, 20200307, 20200216, 20200107, 20200108, 20200214, 20200314],
           "A": [20,10,40,40,10,20, 40,30], 
           "B": [20,30,40,50,20, 30, 20, 10]})

df['new']=np.nan
df['new']=np.where((df['Date']>20200201) & (df['A']>20) & (df['B']>20), 'value', df['new'])
df['new']=np.where((df['Date']>20200201) & (df['A']==30) & (df['B']==10), 'value', df['new'])
df['new']=np.where((df['Date']>20200201) & (df['A']==20) & (df['B']>=10), 'value', df['new'])

Looks like you can use np.select :看起来你可以使用np.select

s1 = df.Date <= 20200201
s2 = (df['A'] > 20) & df['B'].gt(20)
s3 = df['A'].eq(30) & df['B'].eq(10)
s4 = df['A'].eq(20) & df['B'].ge(10)

df['new'] = np.select( (s1,s2|s3|s4), (np.nan, 'value'), np.nan)

Output: Output:

   ID      Date   A   B    new
0   1  20200109  20  20    nan
1   3  20200204  10  30    nan
2   2  20200307  40  40  value
3   2  20200216  40  50  value
4   3  20200107  10  20    nan
5   1  20200108  20  30    nan
6   3  20200214  40  20    nan
7   2  20200314  30  10  value

It is probably not the fastest solution, but its advantage is readability and easy maintenance (in the future).它可能不是最快的解决方案,但它的优点是可读性易于维护(将来)。

  1. Find rows in question using query and the indices of these rows:使用查询和这些行的索引查找有问题的行:

     ind = df.query('Date > 20200201 and (A > 20 and B > 20 or ' 'A == 30 and B == 10 or A == 20 and B >= 10)').index
  2. Save new value in new column, in the indicated rows:在新列中保存新值,在指示的行中:

     df.loc[ind, 'new'] = 'value'; df

Other values in this column remain NaN .此列中的其他值仍为NaN

If in the future something changes in the above condition, it is quite easy and intuitive to correct it.如果将来上述情况发生变化,纠正它是相当容易和直观的。

So unless your data volume is very big and the execution time is prohibitively long, this solution is worth to consider.因此,除非您的数据量非常大并且执行时间过长,否则该解决方案值得考虑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM