简体   繁体   中英

How to replace higher-than-threshold values by the median of previous 3 rows (30 minutes interval) within a group in Pandas?

I have the following pandas DataFrame:

col1 col2                   col3    col4 
A    2021-03-28 02:40:00    1.50    0.0
A    2021-03-28 02:40:00    1.80    0.0
A    2021-03-28 02:50:00    0.50    0.0
A    2021-03-28 03:00:00    10.00   0.0
A    2021-03-28 03:10:00    0.00    0.0
A    2021-03-28 03:20:00    0.00    0.0
A    2021-03-28 03:30:00    0.14    0.0

All col3 values that are greater than 5 should be replaced with the median of the past 30 minutes, which corresponds to the previous 3 rows.

Expected result:

col1 col2                   col3    col4 
A    2021-03-28 02:40:00    1.50    0.0
A    2021-03-28 02:40:00    1.80    0.0
A    2021-03-28 02:50:00    0.50    0.0
A    2021-03-28 03:00:00    1.50   0.0
A    2021-03-28 03:10:00    0.00    0.0
A    2021-03-28 03:20:00    0.00    0.0
A    2021-03-28 03:30:00    0.14    0.0

Thus, the value 10 in col3 was substituted by 1.5 , which is the median of previous 3 rows: np.median([1.5, 1.8, 0.5]) .

How can I automate it for the whole DataFrame.

We can break it into 2 parts:

  1. The first part on calculation of the median of the past 30 minutes, which corresponds to the previous 3 rows (within the grouping of col1 )
  2. The second part on filtering on condition higher than threshold in order to get the new values.

1st Part : You can use .rolling() to get a rolling window of past 30 minutes and then use .apply() to apply the np.median function on this rolling window. Then .shift() so that we get the entry of previous row.

Here we use a rolling window of 30T instead of fixed number of intervals as the window size. This notion has the advantage that your data is not constrained to consistent and fixed 5 mins, 10 minutes or 15 minutes intervals. So long as you want to calculate for 30 minutes, Pandas would get the correct number of intervals to work on.

As the time sequence in col2 is for the scope of a specific value of col1 (probably some kind of grouping), we have to further use .groupby() on col1 to process the time sequence in segments for each col1 grouping accordingly.

2nd Part : We use .mask() on condition of df['col3'] > threshold , and if condition holds true, we replace with value of those calculated in 1st part.

Here's the codes:

df['col2'] = pd.to_datetime(df['col2'])  

threshold = 5
df['col3'] = (df['col3'].mask(
                 df['col3'] > threshold, 
                 df.groupby('col1')
                   .rolling('30T', on='col2')['col3']
                   .apply(lambda x: np.median(x))
                   .shift()
                   .reset_index()['col3'])
             )

Result:

print(df)


  col1                col2  col3  col4
0    A 2021-03-28 02:40:00  1.50   0.0
1    A 2021-03-28 02:40:00  1.80   0.0
2    A 2021-03-28 02:50:00  0.50   0.0
3    A 2021-03-28 03:00:00  1.50   0.0
4    A 2021-03-28 03:10:00  0.00   0.0
5    A 2021-03-28 03:20:00  0.00   0.0
6    A 2021-03-28 03:30:00  0.14   0.0
df['col3'].rolling(4,min_periods=0).apply(lambda x: np.median(x[-4:-1])if x[-1]>5  else x[-1],raw=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM