简体   繁体   中英

Python pandas and numpy: assign numerical values to new variable based on multiple conditions for existing variables

This is trivial in Excel, why is this so hard in Python?

The goal is to compute the state variable based on several conditions, including the previous value of the state variable as well. Value is the known integer, Min(3) and Max(3) are simply the minimum and maximum value of a 3 period rolling window shifted forward by one period. This is how far I got.

Index Value Max(3) Min(3) 
0     10    nan    nan    
1     20    nan    nan        
2     15    nan    nan         
3     25    20     10    
4     15    25     15     
5     10    25     15     
6     15    20     10         

What is the best way to calculate the state variable based on the following conditions:

  • a) If Value > Max(3) then 1
  • b) If Value < Min(3) then 4
  • c) If Value <= Max(3) & Value >= Min (3) & previous State = 1 or 2 then 2
  • d) If Value <= Max(3) & Value >= Min (3) & previous State = 4 or 3 then 3

Should look like this in final DataFrame:

Index Value Max(3) Min(3) State
0     10    nan    nan    nan
1     20    nan    nan    nan    
2     15    nan    nan    nan     
3     25    20     10     1
4     15    25     15     2
5     10    25     15     4
6     15    20     10     3    

I have mostly tried this using np.where() funtion but always run into problems once I approach c) and d) conditions.

You can use this:

df.loc[df.Value.gt(df['Max(3)']), 'State'] = 1
df.loc[df.Value.lt(df['Min(3)']), 'State'] = 4
df.loc[df.Value.between(df['Min(3)'], df['Max(3)']) & (df.State.shift(1).isin((3, 4))), 'State'] = 3
df.loc[df.Value.between(df['Min(3)'], df['Max(3)']) & (df.State.shift(1).isin((1,2))), 'State'] = 2

output:
在此处输入图像描述

Explaination:
the first statement checks where df.Value is greater than df['Max(3)'] and creates a new column 'State' filled with NaNs, and only 1s on the location of the condition

the seconds line sets 4s where df.Value is smaller than df.['Min(3)']

The last two statements check if df.Value is within Max(3) and Min(3) and compares the df.State last value ( .shift ). Note: you can also use .between here instead of .isin if the numbers are subsequential.

np.select can handle multiple conditions quite well with good readability

df['Value'] = df['Value'].astype(float)

conditions =
[
    df['Value']>df['Max(3)'],
    df['Value']<df['Min(3)'],
    (df['Value']<=df['Max(3)']) & (df['Value']>= df['Min(3)']) & (df['State'].shift().isin((1,2))),
    (df['Value']<=df['Max(3)']) & (df['Value']>= df['Min(3)']) & (df['State'].shift().isin((3,4)))
]

choicelist = [1, 4, 2, 3]

df['State'] = np.select(conditions, choicelist, np.nan)
print(df)
   Index  Value  Max(3)  Min(3)  State
0      0   10.0     NaN     NaN    NaN
1      1   20.0     NaN     NaN    NaN
2      2   15.0     NaN     NaN    NaN
3      3   25.0    20.0    10.0    1.0
4      4   15.0    25.0    15.0    2.0
5      5   10.0    25.0    15.0    4.0
6      6   15.0    20.0    10.0    3.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM