简体   繁体   中英

What is the best way to remove interpolation from a time series data in Pandas?

Is there a better way to remove interpolated data from time series data in pandas data frame ?

I have a time series data in which missing values are filled with interpolation but I would like to remove interpolated data and replace then with np.nan values again.

Input Data:

Index                   Column_one     Column_two     
2017:10:03 03:44:00     13.61504936     14.65000057
2017:10:03 03:45:00     13.61504936     14.65000057
2017:10:03 03:46:00     13.61504936     14.65000057
2017:10:03 03:47:00     13.61504936     np.nan
2017:10:03 03:48:00     13.60000038     np.nan
2017:10:03 03:49:00     np.nan          np.nan
2017:10:03 03:50:00     np.nan          np.nan
2017:10:03 03:51:00     np.nan          np.nan
2017:10:03 03:52:00     np.nan          14.80000019
2017:10:03 03:53:00     np.nan          14.80000019
2017:10:03 03:54:00     14.21253681     14.80000019
2017:10:03 03:55:00     14.24253273     14.80000019

All the missing values are filled with interpolation

data_interpolated = data.interpolate()

Interpolated Data:

Index                   Column_one     Column_two     
2017:10:03 03:44:00     13.61504936     14.65000057
2017:10:03 03:45:00     13.61504936     14.65000057
2017:10:03 03:46:00     13.61504936     14.65000057
2017:10:03 03:47:00     13.61504936     14.67500051
2017:10:03 03:48:00     13.60000038     14.70000044
2017:10:03 03:49:00     13.70208979     14.72500038
2017:10:03 03:50:00     13.80417919     14.75000032
2017:10:03 03:51:00     13.9062686      14.77500025
2017:10:03 03:52:00     14.008358       14.80000019
2017:10:03 03:53:00     14.11044741     14.80000019
2017:10:03 03:54:00     14.21253681     14.80000019
2017:10:03 03:55:00     14.24253273     14.80000019

Now I would like to remove the interpolated values and get the initial data set.

Desired Output:

Index                   Column_one     Column_two     
2017:10:03 03:44:00     13.61504936     14.65000057
2017:10:03 03:45:00     13.61504936     14.65000057
2017:10:03 03:46:00     13.61504936     14.65000057
2017:10:03 03:47:00     13.61504936     np.nan
2017:10:03 03:48:00     13.60000038     np.nan
2017:10:03 03:49:00     np.nan          np.nan
2017:10:03 03:50:00     np.nan          np.nan
2017:10:03 03:51:00     np.nan          np.nan
2017:10:03 03:52:00     np.nan          14.80000019
2017:10:03 03:53:00     np.nan          14.80000019
2017:10:03 03:54:00     14.21253681     14.80000019
2017:10:03 03:55:00     14.24253273     14.80000019

Please let me know if there is any good way to implement this in Pandas or Numpy ?

I can raise you something like this:

for i in xrange(df.__len__()):
    if i == 0:
        continue
    df.loc[i, ('lin_one')] = df.loc[i, ('one')] - df.loc[i - 1, ('one')]
    df.loc[i, ('lin_two')] = df.loc[i, ('two')] - df.loc[i - 1, ('two')]

for i in xrange(df.__len__()-1):
    if df.lin_one[i] - df.lin_one[i+1] != 0 and df.lin_one[i] - df.lin_one[i+1] < 0.003:
        df.loc[i,('one')] = np.nan
    if df.lin_two[i] - df.lin_two[i+1] != 0 and df.lin_two[i] - df.lin_two[i+1] < 0.003:
        df.loc[i,('two')] = np.nan

This will produce the following output:

                  index        one   lin_one        two  lin_two
0   2017:10:03 03:44:00  13.615049  0.000000  14.650001    0.000
1   2017:10:03 03:45:00  13.615049  0.000000  14.650001    0.000
2   2017:10:03 03:46:00  13.615049  0.000000        NaN    0.000
3   2017:10:03 03:47:00  13.615049  0.000000        NaN    0.025
4   2017:10:03 03:48:00        NaN -0.015049        NaN    0.025
5   2017:10:03 03:49:00        NaN  0.102089        NaN    0.025
6   2017:10:03 03:50:00        NaN  0.102089        NaN    0.025
7   2017:10:03 03:51:00        NaN  0.102089        NaN    0.025
8   2017:10:03 03:52:00        NaN  0.102089  14.800000    0.025
9   2017:10:03 03:53:00        NaN  0.102089  14.800000    0.000
10  2017:10:03 03:54:00  14.212537  0.102089  14.800000    0.000
11  2017:10:03 03:55:00  14.242533  0.029996  14.800000    0.000

then you can delete the calculating columns lin_one and lin_two :

del df['lin_one']
del df['lin_two']

But this method kills one value of the not interpolated data...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM