I have the following pandas DataFrame:
col1 col2 col3 col4
A 2021-03-28 02:40:00 1.50 0.0
A 2021-03-28 02:40:00 1.80 0.0
A 2021-03-28 02:50:00 0.50 0.0
A 2021-03-28 03:00:00 10.00 0.0
A 2021-03-28 03:10:00 0.00 0.0
A 2021-03-28 03:20:00 0.00 0.0
A 2021-03-28 03:30:00 0.14 0.0
All col3
values that are greater than 5 should be replaced with the median of the past 30 minutes, which corresponds to the previous 3 rows.
Expected result:
col1 col2 col3 col4
A 2021-03-28 02:40:00 1.50 0.0
A 2021-03-28 02:40:00 1.80 0.0
A 2021-03-28 02:50:00 0.50 0.0
A 2021-03-28 03:00:00 1.50 0.0
A 2021-03-28 03:10:00 0.00 0.0
A 2021-03-28 03:20:00 0.00 0.0
A 2021-03-28 03:30:00 0.14 0.0
Thus, the value 10 in col3
was substituted by 1.5
, which is the median of previous 3 rows: np.median([1.5, 1.8, 0.5])
.
How can I automate it for the whole DataFrame.
We can break it into 2 parts:
col1
) 1st Part : You can use .rolling()
to get a rolling window of past 30 minutes and then use .apply()
to apply the np.median
function on this rolling window. Then .shift()
so that we get the entry of previous row.
Here we use a rolling window of 30T
instead of fixed number of intervals as the window size. This notion has the advantage that your data is not constrained to consistent and fixed 5 mins, 10 minutes or 15 minutes intervals. So long as you want to calculate for 30 minutes, Pandas would get the correct number of intervals to work on.
As the time sequence in col2
is for the scope of a specific value of col1
(probably some kind of grouping), we have to further use .groupby()
on col1
to process the time sequence in segments for each col1
grouping accordingly.
2nd Part : We use .mask()
on condition of df['col3'] > threshold
, and if condition holds true, we replace with value of those calculated in 1st part.
Here's the codes:
df['col2'] = pd.to_datetime(df['col2'])
threshold = 5
df['col3'] = (df['col3'].mask(
df['col3'] > threshold,
df.groupby('col1')
.rolling('30T', on='col2')['col3']
.apply(lambda x: np.median(x))
.shift()
.reset_index()['col3'])
)
Result:
print(df)
col1 col2 col3 col4
0 A 2021-03-28 02:40:00 1.50 0.0
1 A 2021-03-28 02:40:00 1.80 0.0
2 A 2021-03-28 02:50:00 0.50 0.0
3 A 2021-03-28 03:00:00 1.50 0.0
4 A 2021-03-28 03:10:00 0.00 0.0
5 A 2021-03-28 03:20:00 0.00 0.0
6 A 2021-03-28 03:30:00 0.14 0.0
df['col3'].rolling(4,min_periods=0).apply(lambda x: np.median(x[-4:-1])if x[-1]>5 else x[-1],raw=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.