简体   繁体   中英

Pandas take nearest value to the second and interpolate

I'm looking to convert a data frame of the following format as an example:

>>>df
                      vals
2019-08-10 12:03:05   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:07   NaN
2019-08-10 12:03:08   3.0
2019-08-10 12:03:09   4.0
2019-08-10 12:03:10   NaN
2019-08-10 12:03:11   NaN
2019-08-10 12:03:12   5.0
2019-08-10 12:03:13   NaN
2019-08-10 12:03:14   1.0
2019-08-10 12:03:15   NaN
2019-08-10 12:03:16   NaN
2019-08-10 12:03:17   6.0

into one such as:

>>>df
                      vals
2019-08-10 12:03:05   1.0
2019-08-10 12:03:06   1.667
2019-08-10 12:03:07   2.333
2019-08-10 12:03:08   3.0
2019-08-10 12:03:09   3.667
2019-08-10 12:03:10   4.333
2019-08-10 12:03:11   5.0
2019-08-10 12:03:12   3.667
2019-08-10 12:03:13   2.333
2019-08-10 12:03:14   1.0
2019-08-10 12:03:15   2.667
2019-08-10 12:03:16   4.333
2019-08-10 12:03:17   6.0

Where the dataframe was first aligned to look like the following (taking the closest value to every 3rd value):

>>>df
                      vals
2019-08-10 12:03:05   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:07   NaN
2019-08-10 12:03:08   3.0
2019-08-10 12:03:09   NaN
2019-08-10 12:03:10   NaN
2019-08-10 12:03:11   5.0
2019-08-10 12:03:12   NaN
2019-08-10 12:03:13   NaN
2019-08-10 12:03:14   1.0
2019-08-10 12:03:15   NaN
2019-08-10 12:03:16   NaN
2019-08-10 12:03:17   6.0

And then linearly interpolated between each value to produce the final dataframe. Should there be a gap of more than 2 seconds, I'd like to just not interpolate between those 2 values.

This is what I've tried so far:

df.resample('3s').nearest()

Which produces:

>>> df.resample('3s').nearest()
                     vals
2019-08-10 12:03:03   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:09   4.0
2019-08-10 12:03:12   5.0
2019-08-10 12:03:15   NaN

Also:

>>> df.resample('2s').nearest()
                     vals
2019-08-10 12:03:04   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:08   3.0
2019-08-10 12:03:10   NaN
2019-08-10 12:03:12   5.0
2019-08-10 12:03:14   1.0
2019-08-10 12:03:16   NaN

Which makes it very clear that nearest is a complete lie, or at least a misnomer, because the nearest value to 10 is quite obviously 4. Also, the final value at 2019-08-10 12:03:16 should definitely be 6.0 .

This is just trying to align the values to the second, after this, simply interpolate seems to work.

Any help is appreciated.

I think you need base parameter for change offset of sampling period with modulo by 3 of first value of index (because 3 seconds) with Resampler.first :

df['new'] = df.resample('3s', base=df.index[0].second % 3).first()
print (df)
                     vals  new
2019-08-10 12:03:05   1.0  1.0
2019-08-10 12:03:06   NaN  NaN
2019-08-10 12:03:07   NaN  NaN
2019-08-10 12:03:08   3.0  3.0
2019-08-10 12:03:09   4.0  NaN
2019-08-10 12:03:10   NaN  NaN
2019-08-10 12:03:11   NaN  5.0
2019-08-10 12:03:12   5.0  NaN
2019-08-10 12:03:13   NaN  NaN
2019-08-10 12:03:14   1.0  1.0
2019-08-10 12:03:15   NaN  NaN
2019-08-10 12:03:16   NaN  NaN
2019-08-10 12:03:17   6.0  6.0

Then iterpolate:

df['new'] = df['new'].interpolate()
print (df)
                     vals       new
2019-08-10 12:03:05   1.0  1.000000
2019-08-10 12:03:06   NaN  1.666667
2019-08-10 12:03:07   NaN  2.333333
2019-08-10 12:03:08   3.0  3.000000
2019-08-10 12:03:09   4.0  3.666667
2019-08-10 12:03:10   NaN  4.333333
2019-08-10 12:03:11   NaN  5.000000
2019-08-10 12:03:12   5.0  3.666667
2019-08-10 12:03:13   NaN  2.333333
2019-08-10 12:03:14   1.0  1.000000
2019-08-10 12:03:15   NaN  2.666667
2019-08-10 12:03:16   NaN  4.333333
2019-08-10 12:03:17   6.0  6.000000

Testing with add 2 seconds to index:

df.index += pd.Timedelta(2, 's')
df['new'] = df.resample('3s', base=df.index[0].second % 3).first()
print (df)

                     vals  new
2019-08-10 12:03:07   1.0  1.0
2019-08-10 12:03:08   NaN  NaN
2019-08-10 12:03:09   NaN  NaN
2019-08-10 12:03:10   3.0  3.0
2019-08-10 12:03:11   4.0  NaN
2019-08-10 12:03:12   NaN  NaN
2019-08-10 12:03:13   NaN  5.0
2019-08-10 12:03:14   5.0  NaN
2019-08-10 12:03:15   NaN  NaN
2019-08-10 12:03:16   1.0  1.0
2019-08-10 12:03:17   NaN  NaN
2019-08-10 12:03:18   NaN  NaN
2019-08-10 12:03:19   6.0  6.0
df1=df.set_index(['Time']).interpolate(method='linear').reset_index()
print(df1)

Output

                   Time     vals
0   2019-08-10 12:03:05     1.000000
1   2019-08-10 12:03:06     1.666667
2   2019-08-10 12:03:07     2.333333
3   2019-08-10 12:03:08     3.000000
4   2019-08-10 12:03:09     4.000000
5   2019-08-10 12:03:10     4.333333
6   2019-08-10 12:03:11     4.666667
7   2019-08-10 12:03:12     5.000000
8   2019-08-10 12:03:13     3.000000
9   2019-08-10 12:03:14     1.000000
10  2019-08-10 12:03:15     2.666667
11  2019-08-10 12:03:16     4.333333
12  2019-08-10 12:03:17     6.000000

如果要用最接近的值替换nan值,则可以使用插值

data['value'] = data['value'].interpolate(method='nearest')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM