简体   繁体   English

在忽略缺失数据的情况下插入 pandas 中的数据集

[英]Interpolating a data set in pandas while ignoring missing data

I have a question of how o get interpolated data across a several different "blocks" of time.我有一个问题,即如何在几个不同的“时间块”内获得插值数据。 In a nut shell, I have a dataset like this:在坚果 shell 中,我有一个这样的数据集:

>>> import pandas as pd
>>> test_data = pd.read_csv("test_data.csv")
>>> test_data
     ID  Condition_num Condition_type  Rating  Timestamp_ms
0   101              1        Active     58.0            30
1   101              1        Active     59.0            60
2   101              1        Active     65.0            90
3   101              1        Active     70.0           120
4   101              1        Active     80.0           150
5   101              2          Break     NaN           180
6   101              3       Active_2    55.0           210
7   101              3       Active_2    60.0           240
8   101              3       Active_2    63.0           270
9   101              3       Active_2    70.0           300
10  101              4          Break     NaN           330
11  101              5       Active_3    69.0           360
12  101              5       Active_3    71.0           390
13  101              5       Active_3    50.0           420
14  101              5       Active_3    41.0           450
15  101              5       Active_3    43.0           480

I need to "resample" the final column to a another time interval (eg 40 ms) to match it to an external data set.我需要将最后一列“重新采样”到另一个时间间隔(例如 40 毫秒)以将其与外部数据集匹配。 I have been using the following code:我一直在使用以下代码:

#Setting the column with timestamps as a datetime with the correct units, then set index
test_data['Timestamp_ms'] = pd.to_datetime(test_data['Timestamp_ms'], unit='ms')
test_data = test_data.set_index('Timestamp_ms')

#Resample index to start at 0, resample to the highest resolution 1ms, then resample to 800ms
test_data = test_data.reindex(

    pd.date_range(start=pd.to_datetime(0, unit='ms'), end=test_data.index.max(), freq='ms')

)

test_data = test_data.resample('1ms').interpolate().resample('40ms').interpolate()

#Round ms to intergers
test_data.xpos = test_data..round()

Which gives me this:这给了我这个:

                            ID  Condition_num Condition_type     Rating
1970-01-01 00:00:00.000    NaN            NaN            NaN        NaN
1970-01-01 00:00:00.040  101.0       1.000000            NaN  58.333333
1970-01-01 00:00:00.080  101.0       1.000000            NaN  63.000000
1970-01-01 00:00:00.120  101.0       1.000000        Active   70.000000
1970-01-01 00:00:00.160  101.0       1.333333            NaN  75.833333
1970-01-01 00:00:00.200  101.0       2.666667            NaN  59.166667
1970-01-01 00:00:00.240  101.0       3.000000       Active_2  60.000000
1970-01-01 00:00:00.280  101.0       3.000000            NaN  65.333333
1970-01-01 00:00:00.320  101.0       3.666667            NaN  69.666667
1970-01-01 00:00:00.360  101.0       5.000000       Active_3  69.000000
1970-01-01 00:00:00.400  101.0       5.000000            NaN  64.000000
1970-01-01 00:00:00.440  101.0       5.000000            NaN  44.000000
1970-01-01 00:00:00.480  101.0       5.000000       Active_3  43.000000

The only issue is I cannot figure out which ratings are happening during the "Active" conditions and whether the ratings I am seeing are caused by extrapolations of the "breaks" where there are no ratings.唯一的问题是我无法弄清楚在“活动”条件下发生了哪些收视率,以及我看到的收视率是否是由没有收视率的“中断”外推引起的。 In so many words, I want the interpolation in the "Active" blocks but also have everything aligned to the beginning of the whole data set.总而言之,我希望在“活动”块中进行插值,但也要让所有内容都与整个数据集的开头对齐。

I have tried entering Zero ratings for NaN and interpolating from the top of each condition, but that seems only to make the problem worse by altering the ratings more.我尝试为 NaN 输入零评级并从每个条件的顶部进行插值,但这似乎只会通过更多地改变评级来使问题变得更糟。

Any advice would be greatly appreciated!任何建议将不胜感激!

I think you need to do all of your logic inside of a groupby, IIUC:我认为您需要在 IIUC 的 groupby 中完成所有逻辑:

mask = df.Condition_type.ne('Break')
df2 = (df[mask].groupby('Condition_type') # Groupby Condition_type, excluding "Break" rows.
                .apply(lambda x: x.resample('1ms') # To each group... resample it.
                                  .interpolate()   # Interpolate
                                  .ffill()         # Fill values, this just applies to the Condition_type.
                                  .resample('40ms')# Resample to 40ms
                                  .asfreq())       # No need to interpolate in this direction.
                .reset_index('Condition_type', drop=True)) # We no longer need this extra index~

# Force the index to our resample'd interval, this will reveal the breaks:
df2 = df2.asfreq('40ms')
print(df2)

Output: Output:

                            ID  Condition_num Condition_type     Rating
Timestamp_ms
1970-01-01 00:00:00.000    NaN            NaN            NaN        NaN
1970-01-01 00:00:00.040  101.0            1.0         Active  58.333333
1970-01-01 00:00:00.080  101.0            1.0         Active  63.000000
1970-01-01 00:00:00.120  101.0            1.0         Active  70.000000
1970-01-01 00:00:00.160    NaN            NaN            NaN        NaN
1970-01-01 00:00:00.200    NaN            NaN            NaN        NaN
1970-01-01 00:00:00.240  101.0            3.0       Active_2  60.000000
1970-01-01 00:00:00.280  101.0            3.0       Active_2  65.333333
1970-01-01 00:00:00.320    NaN            NaN            NaN        NaN
1970-01-01 00:00:00.360  101.0            5.0       Active_3  69.000000
1970-01-01 00:00:00.400  101.0            5.0       Active_3  64.000000
1970-01-01 00:00:00.440  101.0            5.0       Active_3  44.000000
1970-01-01 00:00:00.480  101.0            5.0       Active_3  43.000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM