简体   繁体   中英

Pandas: how to fill missing data with a mean value?

I read from a remote device some data every 5 seconds.

They are saved as:

 2018-01-01 00:00:00    2
 2018-01-01 00:00:05    3
 2018-01-01 00:00:10    3
 2018-01-01 00:00:15    2
 2018-01-01 00:00:20    3
 2018-01-01 00:00:25    4
 2018-01-01 00:00:30    3
 2018-01-01 00:00:35    2
 2018-01-01 00:00:40    4
 2018-01-01 00:00:45    5
 2018-01-01 00:00:50    3
 2018-01-01 00:00:55    3

Alas communication is not the best, and sometimes communication is not working.

In that case the remote device will provide the cumulative value of the data as soon as possible.

The previous data could be saved as:

 2018-01-01 00:00:00     2
 2018-01-01 00:00:05     3
 2018-01-01 00:00:10     3
 .......... 00:00:15  missing...
 .......... 00:00:20  missing...
 .......... 00:00:25  missing...
 2018-01-01 00:00:30    12      <--- sum of the last 4 readings 
 2018-01-01 00:00:35     2
 .......... 00:00:40  missing...
 .......... 00:00:45  missing...
 2018-01-01 00:00:50    15      <--- sum of the last 3 readings
 2018-01-01 00:00:55     3

I need to fill all missing rows and remove the peaks in the original data with the mean value calculated on the peak .

Resampling is easy:

 2018-01-01 00:00:00      2
 2018-01-01 00:00:05      3
 2018-01-01 00:00:10      3
 2018-01-01 00:00:15    NaN
 2018-01-01 00:00:20    NaN
 2018-01-01 00:00:25    NaN
 2018-01-01 00:00:30     12
 2018-01-01 00:00:35      2
 2018-01-01 00:00:40    NaN
 2018-01-01 00:00:45    NaN
 2018-01-01 00:00:50     15
 2018-01-01 00:00:55      3

but how to fill NaN and remove the peaks?

I checked the various methods for asfreq and resample but none of them ( bfill , ffill ) is useful in this case.

The final result should be:

 2018-01-01 00:00:00      2
 2018-01-01 00:00:05      3
 2018-01-01 00:00:10      3
 2018-01-01 00:00:15      3 <--- NaN filled with mean = peak 12/4 rows
 2018-01-01 00:00:20      3 <--- NaN filled with mean
 2018-01-01 00:00:25      3 <--- NaN filled with mean
 2018-01-01 00:00:30      3 <--- peak changed
 2018-01-01 00:00:35      2
 2018-01-01 00:00:40      5  <--- NaN filled with mean = peak 15/3 rows
 2018-01-01 00:00:45      5  <--- NaN filled with mean
 2018-01-01 00:00:50      5  <--- peak changed
 2018-01-01 00:00:55      3

Dataframes I used for testing:

import numpy as np
import pandas as pd

time = pd.date_range(start='2021-01-01', freq='5s', periods=12)
read_data = pd.Series([2, 3, 3, np.nan, np.nan, np.nan, 12, 2, np.nan, np.nan, 15, 3], index=time).dropna()

read_data.asfreq("5s")

One way:

m = (read_data.isna() | read_data.shift(fill_value= 0).isna()).astype(int)
read_data = read_data.bfill() / m.groupby(m.ne(m.shift()).cumsum()).transform('count').where(m.eq(1), 1)

OUTPUT:

2021-01-01 00:00:00    2.0
2021-01-01 00:00:05    3.0
2021-01-01 00:00:10    3.0
2021-01-01 00:00:15    3.0
2021-01-01 00:00:20    3.0
2021-01-01 00:00:25    3.0
2021-01-01 00:00:30    3.0
2021-01-01 00:00:35    2.0
2021-01-01 00:00:40    5.0
2021-01-01 00:00:45    5.0
2021-01-01 00:00:50    5.0
2021-01-01 00:00:55    3.0
Freq: 5S, dtype: float64

Complete Example:

import numpy as np
import pandas as pd

time = pd.date_range(start='2021-01-01', freq='5s', periods=12)
read_data = pd.Series([2, 3, 3, np.nan, np.nan, np.nan, 12, 2, np.nan, np.nan, 15, 3], index=time).dropna()

read_data = read_data.asfreq("5s")

m = (read_data.isna() | read_data.shift(fill_value= 0).isna()).astype(int)
read_data = read_data.bfill() / m.groupby(m.ne(m.shift()).cumsum()).transform('count').where(m.eq(1), 1)

This can be done by segmenting ( grouping ) the missing values together with its corresponding peak value (after resampling) into a single group, backfill and then calculate mean of each group:

>>> read_data = read_data.to_frame(name='val').assign(idx=range(len(read_data)))
>>> read_data = read_data.asfreq('5s').bfill()
>>> read_data = read_data/read_data.groupby('idx').transform(len)
>>> read_data.drop('idx', axis=1, inplace=True)
>>> read_data.val
2021-01-01 00:00:00    2.0
2021-01-01 00:00:05    3.0
2021-01-01 00:00:10    3.0
2021-01-01 00:00:15    3.0
2021-01-01 00:00:20    3.0
2021-01-01 00:00:25    3.0
2021-01-01 00:00:30    3.0
2021-01-01 00:00:35    2.0
2021-01-01 00:00:40    5.0
2021-01-01 00:00:45    5.0
2021-01-01 00:00:50    5.0
2021-01-01 00:00:55    3.0
Freq: 5S, Name: val, dtype: float64

Explanation:

First convert your original series to dataframe and introduce another column idx which will uniquely identify each row as single groups:

>>> read_data = read_data.to_frame(name='val').assign(idx=range(len(read_data)))
>>> read_data
                      val  idx
2021-01-01 00:00:00   2.0    0
2021-01-01 00:00:05   3.0    1
2021-01-01 00:00:10   3.0    2
2021-01-01 00:00:30  12.0    3
2021-01-01 00:00:35   2.0    4
2021-01-01 00:00:50  15.0    5
2021-01-01 00:00:55   3.0    6

Resample to insert missing values, and then back-fill the missing values with peak values:

>>> read_data = read_data.asfreq('5s').bfill()
>>> read_data
                      val  idx
2021-01-01 00:00:00   2.0  0.0
2021-01-01 00:00:05   3.0  1.0
2021-01-01 00:00:10   3.0  2.0
2021-01-01 00:00:15  12.0  3.0
2021-01-01 00:00:20  12.0  3.0
2021-01-01 00:00:25  12.0  3.0
2021-01-01 00:00:30  12.0  3.0
2021-01-01 00:00:35   2.0  4.0
2021-01-01 00:00:40  15.0  5.0
2021-01-01 00:00:45  15.0  5.0
2021-01-01 00:00:50  15.0  5.0
2021-01-01 00:00:55   3.0  6.0

As you see now, the backfilled values are in same group as their peak (have same idx ).
So groupby idx and just divide the the values by length of each group. Remove the idx column:

>>> read_data = read_data/read_data.groupby('idx').transform(len)
>>> read_data.drop('idx', axis=1, inplace=True)
>>> read_data
                     val
2021-01-01 00:00:00  2.0
2021-01-01 00:00:05  3.0
2021-01-01 00:00:10  3.0
2021-01-01 00:00:15  3.0
2021-01-01 00:00:20  3.0
2021-01-01 00:00:25  3.0
2021-01-01 00:00:30  3.0
2021-01-01 00:00:35  2.0
2021-01-01 00:00:40  5.0
2021-01-01 00:00:45  5.0
2021-01-01 00:00:50  5.0
2021-01-01 00:00:55  3.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM