Pandas take nearest value to the second and interpolate

Question

I'm looking to convert a data frame of the following format as an example:

>>>df
                      vals
2019-08-10 12:03:05   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:07   NaN
2019-08-10 12:03:08   3.0
2019-08-10 12:03:09   4.0
2019-08-10 12:03:10   NaN
2019-08-10 12:03:11   NaN
2019-08-10 12:03:12   5.0
2019-08-10 12:03:13   NaN
2019-08-10 12:03:14   1.0
2019-08-10 12:03:15   NaN
2019-08-10 12:03:16   NaN
2019-08-10 12:03:17   6.0

into one such as:

>>>df
                      vals
2019-08-10 12:03:05   1.0
2019-08-10 12:03:06   1.667
2019-08-10 12:03:07   2.333
2019-08-10 12:03:08   3.0
2019-08-10 12:03:09   3.667
2019-08-10 12:03:10   4.333
2019-08-10 12:03:11   5.0
2019-08-10 12:03:12   3.667
2019-08-10 12:03:13   2.333
2019-08-10 12:03:14   1.0
2019-08-10 12:03:15   2.667
2019-08-10 12:03:16   4.333
2019-08-10 12:03:17   6.0

Where the dataframe was first aligned to look like the following (taking the closest value to every 3rd value):

>>>df
                      vals
2019-08-10 12:03:05   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:07   NaN
2019-08-10 12:03:08   3.0
2019-08-10 12:03:09   NaN
2019-08-10 12:03:10   NaN
2019-08-10 12:03:11   5.0
2019-08-10 12:03:12   NaN
2019-08-10 12:03:13   NaN
2019-08-10 12:03:14   1.0
2019-08-10 12:03:15   NaN
2019-08-10 12:03:16   NaN
2019-08-10 12:03:17   6.0

And then linearly interpolated between each value to produce the final dataframe. Should there be a gap of more than 2 seconds, I'd like to just not interpolate between those 2 values.

This is what I've tried so far:

df.resample('3s').nearest()

Which produces:

>>> df.resample('3s').nearest()
                     vals
2019-08-10 12:03:03   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:09   4.0
2019-08-10 12:03:12   5.0
2019-08-10 12:03:15   NaN

Also:

>>> df.resample('2s').nearest()
                     vals
2019-08-10 12:03:04   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:08   3.0
2019-08-10 12:03:10   NaN
2019-08-10 12:03:12   5.0
2019-08-10 12:03:14   1.0
2019-08-10 12:03:16   NaN

Which makes it very clear that nearest is a complete lie, or at least a misnomer, because the nearest value to 10 is quite obviously 4. Also, the final value at 2019-08-10 12:03:16 should definitely be 6.0 .

This is just trying to align the values to the second, after this, simply interpolate seems to work.

Any help is appreciated.

Answer 1

I think you need base parameter for change offset of sampling period with modulo by 3 of first value of index (because 3 seconds) with Resampler.first :

df['new'] = df.resample('3s', base=df.index[0].second % 3).first()
print (df)
                     vals  new
2019-08-10 12:03:05   1.0  1.0
2019-08-10 12:03:06   NaN  NaN
2019-08-10 12:03:07   NaN  NaN
2019-08-10 12:03:08   3.0  3.0
2019-08-10 12:03:09   4.0  NaN
2019-08-10 12:03:10   NaN  NaN
2019-08-10 12:03:11   NaN  5.0
2019-08-10 12:03:12   5.0  NaN
2019-08-10 12:03:13   NaN  NaN
2019-08-10 12:03:14   1.0  1.0
2019-08-10 12:03:15   NaN  NaN
2019-08-10 12:03:16   NaN  NaN
2019-08-10 12:03:17   6.0  6.0

Then iterpolate:

df['new'] = df['new'].interpolate()
print (df)
                     vals       new
2019-08-10 12:03:05   1.0  1.000000
2019-08-10 12:03:06   NaN  1.666667
2019-08-10 12:03:07   NaN  2.333333
2019-08-10 12:03:08   3.0  3.000000
2019-08-10 12:03:09   4.0  3.666667
2019-08-10 12:03:10   NaN  4.333333
2019-08-10 12:03:11   NaN  5.000000
2019-08-10 12:03:12   5.0  3.666667
2019-08-10 12:03:13   NaN  2.333333
2019-08-10 12:03:14   1.0  1.000000
2019-08-10 12:03:15   NaN  2.666667
2019-08-10 12:03:16   NaN  4.333333
2019-08-10 12:03:17   6.0  6.000000

Testing with add 2 seconds to index:

df.index += pd.Timedelta(2, 's')
df['new'] = df.resample('3s', base=df.index[0].second % 3).first()
print (df)

                     vals  new
2019-08-10 12:03:07   1.0  1.0
2019-08-10 12:03:08   NaN  NaN
2019-08-10 12:03:09   NaN  NaN
2019-08-10 12:03:10   3.0  3.0
2019-08-10 12:03:11   4.0  NaN
2019-08-10 12:03:12   NaN  NaN
2019-08-10 12:03:13   NaN  5.0
2019-08-10 12:03:14   5.0  NaN
2019-08-10 12:03:15   NaN  NaN
2019-08-10 12:03:16   1.0  1.0
2019-08-10 12:03:17   NaN  NaN
2019-08-10 12:03:18   NaN  NaN
2019-08-10 12:03:19   6.0  6.0

Answer 2

df1=df.set_index(['Time']).interpolate(method='linear').reset_index()
print(df1)

Output

                   Time     vals
0   2019-08-10 12:03:05     1.000000
1   2019-08-10 12:03:06     1.666667
2   2019-08-10 12:03:07     2.333333
3   2019-08-10 12:03:08     3.000000
4   2019-08-10 12:03:09     4.000000
5   2019-08-10 12:03:10     4.333333
6   2019-08-10 12:03:11     4.666667
7   2019-08-10 12:03:12     5.000000
8   2019-08-10 12:03:13     3.000000
9   2019-08-10 12:03:14     1.000000
10  2019-08-10 12:03:15     2.666667
11  2019-08-10 12:03:16     4.333333
12  2019-08-10 12:03:17     6.000000

Answer 3

如果要用最接近的值替换nan值，则可以使用插值

data['value'] = data['value'].interpolate(method='nearest')

Pandas take nearest value to the second and interpolate

Question

3 answers

solution1
1 ACCPTED 2019-09-09 08:14:35

solution2
1 2019-09-09 08:17:16

solution3
0 2019-09-09 08:04:07

Pandas take nearest value to the second and interpolate

Question

3 answers

solution1 1 ACCPTED 2019-09-09 08:14:35

solution2 1 2019-09-09 08:17:16

solution3 0 2019-09-09 08:04:07

solution1
1 ACCPTED 2019-09-09 08:14:35

solution2
1 2019-09-09 08:17:16

solution3
0 2019-09-09 08:04:07