简体   繁体   中英

Conditional creation of a dataframe column based on numeric values

I have a pandas dataframe timeseries (of about 1000 rows and the four columns below) that looks like this:

Date          Values  Avg    +1 Stdev
01/01/2010    1.01    1.00   1.05
02/01/2010    1.02    1.00   1.05
03/01/2010    1.04    1.00   1.05
04/01/2010    -0.97   1.00   1.05
05/01/2010    1.12    1.00   1.05
06/01/2010    1.08    1.00   1.05
....

What I'm trying to do is create a fifth column (called 'Trigger Date'), where if the value in column 2 breaches the threshold set in column 4, then the new column returns the date (from the index column), otherwise no value is returned. The additional constraint here is that the fifth column should ALSO NOT return a date if the previous value already breached the threshold in column 4.

In other words, the psuedocode for the problem would be:

If df['Values'] > df['+1 Stdev']
AND
If df['Values'] (for the row above) < df['+1 Stdev']
THEN
Return df['Date'] in new column df['Trigger Date']
ELSE
Leave row in df['Trigger Date'] blank

Any help on how to tackle this would be greatly appreciated

EDIT: Additional question - any way to add a third constraint, where no trigger date is returned if one has already occurred in the past XX days (eg in the past 30 days)? So expected would look like:

         Date  Values  Avg  +1 Stdev Trigger Date
0  01/01/2010    1.01  1.0      1.05          NaN
1  02/01/2010    1.02  1.0      1.05          NaN
2  03/01/2010    1.04  1.0      1.05          NaN
3  04/01/2010   -0.97  1.0      1.05          NaN
4  05/01/2010    1.12  1.0      1.05   05/01/2010
5  06/01/2010    1.08  1.0      1.05          NaN
6  07/01/2010    1.03  1.0      1.05          NaN
7  08/01/2010    1.07  1.0      1.05          NaN <- above threshold, but trigger occurred within last 30 days so don't return date
...
50 20/02/2010    1.12  1.0      1.05          20/02/2010 <- more than 30 days later, no trigger dates in between, so return date

Use numpy.where with shift for values above row:

m1 = df['Values'] > df['+1 Stdev']
m2 = df['Values'].shift() < df['+1 Stdev']

df['Trigger Date'] = np.where(m1 & m2, df['Date'], np.nan)
print (df)
         Date  Values  Avg  +1 Stdev Trigger Date
0  01/01/2010    1.01  1.0      1.05          NaN
1  02/01/2010    1.02  1.0      1.05          NaN
2  03/01/2010    1.04  1.0      1.05          NaN
3  04/01/2010   -0.97  1.0      1.05          NaN
4  05/01/2010    1.12  1.0      1.05   05/01/2010
5  06/01/2010    1.08  1.0      1.05          NaN

EDIT:

df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

m1 = df['Values'] > df['+1 Stdev']
m2 = df['Values'].shift() < df['+1 Stdev']
a = df['Date'] - pd.Timedelta(30, unit='d')
L = [df['Date'].shift(-1).isin(pd.date_range(x, y, freq='d')) for x, y in zip(a, df['Date'] )]
m3 = np.logical_or.reduce(L)

mask = (m1 & m2) | ~m3

df.loc[mask, 'Trigger Date'] = df['Date']
print (df)
        Date  Values  Avg  +1 Stdev Trigger Date
0 2010-01-01    1.01  1.0      1.05          NaT
1 2010-01-02    1.02  1.0      1.05          NaT
2 2010-01-03    1.04  1.0      1.05          NaT
3 2010-01-04   -0.97  1.0      1.05          NaT
4 2010-01-05    1.12  1.0      1.05   2010-01-05
5 2010-01-06    1.08  1.0      1.05          NaT
6 2010-02-20    1.12  1.0      1.05   2010-02-20

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM