I read from a remote device some data every 5 seconds.
They are saved as:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
2018-01-01 00:00:15 2
2018-01-01 00:00:20 3
2018-01-01 00:00:25 4
2018-01-01 00:00:30 3
2018-01-01 00:00:35 2
2018-01-01 00:00:40 4
2018-01-01 00:00:45 5
2018-01-01 00:00:50 3
2018-01-01 00:00:55 3
Alas communication is not the best, and sometimes communication is not working.
In that case the remote device will provide the cumulative value of the data as soon as possible.
The previous data could be saved as:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
.......... 00:00:15 missing...
.......... 00:00:20 missing...
.......... 00:00:25 missing...
2018-01-01 00:00:30 12 <--- sum of the last 4 readings
2018-01-01 00:00:35 2
.......... 00:00:40 missing...
.......... 00:00:45 missing...
2018-01-01 00:00:50 15 <--- sum of the last 3 readings
2018-01-01 00:00:55 3
I need to fill all missing rows and remove the peaks in the original data with the mean value calculated on the peak .
Resampling is easy:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
2018-01-01 00:00:15 NaN
2018-01-01 00:00:20 NaN
2018-01-01 00:00:25 NaN
2018-01-01 00:00:30 12
2018-01-01 00:00:35 2
2018-01-01 00:00:40 NaN
2018-01-01 00:00:45 NaN
2018-01-01 00:00:50 15
2018-01-01 00:00:55 3
but how to fill NaN and remove the peaks?
I checked the various methods for asfreq
and resample
but none of them ( bfill
, ffill
) is useful in this case.
The final result should be:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
2018-01-01 00:00:15 3 <--- NaN filled with mean = peak 12/4 rows
2018-01-01 00:00:20 3 <--- NaN filled with mean
2018-01-01 00:00:25 3 <--- NaN filled with mean
2018-01-01 00:00:30 3 <--- peak changed
2018-01-01 00:00:35 2
2018-01-01 00:00:40 5 <--- NaN filled with mean = peak 15/3 rows
2018-01-01 00:00:45 5 <--- NaN filled with mean
2018-01-01 00:00:50 5 <--- peak changed
2018-01-01 00:00:55 3
Dataframes I used for testing:
import numpy as np
import pandas as pd
time = pd.date_range(start='2021-01-01', freq='5s', periods=12)
read_data = pd.Series([2, 3, 3, np.nan, np.nan, np.nan, 12, 2, np.nan, np.nan, 15, 3], index=time).dropna()
read_data.asfreq("5s")
One way:
m = (read_data.isna() | read_data.shift(fill_value= 0).isna()).astype(int)
read_data = read_data.bfill() / m.groupby(m.ne(m.shift()).cumsum()).transform('count').where(m.eq(1), 1)
OUTPUT:
2021-01-01 00:00:00 2.0
2021-01-01 00:00:05 3.0
2021-01-01 00:00:10 3.0
2021-01-01 00:00:15 3.0
2021-01-01 00:00:20 3.0
2021-01-01 00:00:25 3.0
2021-01-01 00:00:30 3.0
2021-01-01 00:00:35 2.0
2021-01-01 00:00:40 5.0
2021-01-01 00:00:45 5.0
2021-01-01 00:00:50 5.0
2021-01-01 00:00:55 3.0
Freq: 5S, dtype: float64
Complete Example:
import numpy as np
import pandas as pd
time = pd.date_range(start='2021-01-01', freq='5s', periods=12)
read_data = pd.Series([2, 3, 3, np.nan, np.nan, np.nan, 12, 2, np.nan, np.nan, 15, 3], index=time).dropna()
read_data = read_data.asfreq("5s")
m = (read_data.isna() | read_data.shift(fill_value= 0).isna()).astype(int)
read_data = read_data.bfill() / m.groupby(m.ne(m.shift()).cumsum()).transform('count').where(m.eq(1), 1)
This can be done by segmenting ( grouping ) the missing values together with its corresponding peak value (after resampling) into a single group, backfill and then calculate mean of each group:
>>> read_data = read_data.to_frame(name='val').assign(idx=range(len(read_data)))
>>> read_data = read_data.asfreq('5s').bfill()
>>> read_data = read_data/read_data.groupby('idx').transform(len)
>>> read_data.drop('idx', axis=1, inplace=True)
>>> read_data.val
2021-01-01 00:00:00 2.0
2021-01-01 00:00:05 3.0
2021-01-01 00:00:10 3.0
2021-01-01 00:00:15 3.0
2021-01-01 00:00:20 3.0
2021-01-01 00:00:25 3.0
2021-01-01 00:00:30 3.0
2021-01-01 00:00:35 2.0
2021-01-01 00:00:40 5.0
2021-01-01 00:00:45 5.0
2021-01-01 00:00:50 5.0
2021-01-01 00:00:55 3.0
Freq: 5S, Name: val, dtype: float64
First convert your original series to dataframe and introduce another column idx which will uniquely identify each row as single groups:
>>> read_data = read_data.to_frame(name='val').assign(idx=range(len(read_data)))
>>> read_data
val idx
2021-01-01 00:00:00 2.0 0
2021-01-01 00:00:05 3.0 1
2021-01-01 00:00:10 3.0 2
2021-01-01 00:00:30 12.0 3
2021-01-01 00:00:35 2.0 4
2021-01-01 00:00:50 15.0 5
2021-01-01 00:00:55 3.0 6
Resample to insert missing values, and then back-fill the missing values with peak values:
>>> read_data = read_data.asfreq('5s').bfill()
>>> read_data
val idx
2021-01-01 00:00:00 2.0 0.0
2021-01-01 00:00:05 3.0 1.0
2021-01-01 00:00:10 3.0 2.0
2021-01-01 00:00:15 12.0 3.0
2021-01-01 00:00:20 12.0 3.0
2021-01-01 00:00:25 12.0 3.0
2021-01-01 00:00:30 12.0 3.0
2021-01-01 00:00:35 2.0 4.0
2021-01-01 00:00:40 15.0 5.0
2021-01-01 00:00:45 15.0 5.0
2021-01-01 00:00:50 15.0 5.0
2021-01-01 00:00:55 3.0 6.0
As you see now, the backfilled values are in same group as their peak (have same idx ).
So groupby
idx
and just divide the the values by length of each group. Remove the idx column:
>>> read_data = read_data/read_data.groupby('idx').transform(len)
>>> read_data.drop('idx', axis=1, inplace=True)
>>> read_data
val
2021-01-01 00:00:00 2.0
2021-01-01 00:00:05 3.0
2021-01-01 00:00:10 3.0
2021-01-01 00:00:15 3.0
2021-01-01 00:00:20 3.0
2021-01-01 00:00:25 3.0
2021-01-01 00:00:30 3.0
2021-01-01 00:00:35 2.0
2021-01-01 00:00:40 5.0
2021-01-01 00:00:45 5.0
2021-01-01 00:00:50 5.0
2021-01-01 00:00:55 3.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.