简体   繁体   中英

How to create a rolling window in pandas with another condition

I have a data frame with 2 columns

df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('AB'))


    A   B
0   11  10
1   61  30
2   24  54
3   47  52
4   72  42
... ... ...
95  61  2
96  67  41
97  95  30
98  29  66
99  49  22
100 rows × 2 columns

Now I want to create a third column, which is a rolling window max of col 'A' BUT the max has to be lower than the corresponding value in col 'B'. In other words I want the value of the 4 (using a window size of 4) in column 'A' closest to the value in col 'B', yet smaller than B

So for example in row 3 47 52 the new value I am looking for, is not 61 but 47, because it is the highest value of the 4 that is not higher than 52

pseudo code

df['C'] = df['A'].rolling(window=4).max()  where < df['B']

You can use concat + shift to create a wide DataFrame with the previous values, which makes complicated rolling calculations a bit easier.

Sample Data

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 2)), columns=list('AB'))

Code

N = 4
# End slice ensures same default min_periods behavior to `.rolling`
df1 = pd.concat([df['A'].shift(i).rename(i) for i in range(N)], axis=1).iloc[N-1:]

# Remove values larger than B, then find the max of remaining.
df['C'] = df1.where(df1.lt(df.B, axis=0)).max(1)

print(df.head(15))

     A   B     C
0   51  92   NaN  # Missing b/c min_periods
1   14  71   NaN  # Missing b/c min_periods
2   60  20   NaN  # Missing b/c min_periods
3   82  86  82.0
4   74  74  60.0
5   87  99  87.0
6   23   2   NaN  # Missing b/c 82, 74, 87, 23 all > 2
7   21  52  23.0  # Max of 21, 23, 87, 74 which is < 52
8    1  87  23.0
9   29  37  29.0
10   1  63  29.0
11  59  20   1.0
12  32  75  59.0
13  57  21   1.0
14  88  48  32.0

You can use a custom function to .apply to the rolling window. In this case, you can use a default argument to pass in the B column.

df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=('AB'))

def rollup(a, B=df.B):
    ix = a.index.max()
    b = B[ix]
    return a[a<b].max()

df['C'] = df.A.rolling(4).apply(rollup)

df
# returns:
     A   B     C
0    8  17   NaN
1   23  84   NaN
2   75  84   NaN
3   86  24  23.0
4   52  83  75.0
..  ..  ..   ...
95  38  22   NaN
96  53  48  38.0
97  45   4   NaN
98   3  92  53.0
99  91  86  53.0

The NaN values occur when no number in the window of A is less than B or at the start of the series when the window is too big for the first few rows.

You can use where to replace values that don't fulfill the condition with np.nan and then use rolling(window=4, min_periods=1) :

In [37]: df['C'] = df['A'].where(df['A'] < df['B'], np.nan).rolling(window=4, min_periods=1).max()                                                                                            

In [38]: df                                                                                                                                                                                   
Out[38]: 
    A   B    C
0   0   1  0.0
1   1   2  1.0
2   2   3  2.0
3  10   4  2.0
4   4   5  4.0
5   5   6  5.0
6  10   7  5.0
7  10   8  5.0
8  10   9  5.0
9  10  10  NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM