[英]Interpolating a data set in pandas while ignoring missing data
I have a question of how o get interpolated data across a several different "blocks" of time.我有一个问题,即如何在几个不同的“时间块”内获得插值数据。 In a nut shell, I have a dataset like this:
在坚果 shell 中,我有一个这样的数据集:
>>> import pandas as pd
>>> test_data = pd.read_csv("test_data.csv")
>>> test_data
ID Condition_num Condition_type Rating Timestamp_ms
0 101 1 Active 58.0 30
1 101 1 Active 59.0 60
2 101 1 Active 65.0 90
3 101 1 Active 70.0 120
4 101 1 Active 80.0 150
5 101 2 Break NaN 180
6 101 3 Active_2 55.0 210
7 101 3 Active_2 60.0 240
8 101 3 Active_2 63.0 270
9 101 3 Active_2 70.0 300
10 101 4 Break NaN 330
11 101 5 Active_3 69.0 360
12 101 5 Active_3 71.0 390
13 101 5 Active_3 50.0 420
14 101 5 Active_3 41.0 450
15 101 5 Active_3 43.0 480
I need to "resample" the final column to a another time interval (eg 40 ms) to match it to an external data set.我需要将最后一列“重新采样”到另一个时间间隔(例如 40 毫秒)以将其与外部数据集匹配。 I have been using the following code:
我一直在使用以下代码:
#Setting the column with timestamps as a datetime with the correct units, then set index
test_data['Timestamp_ms'] = pd.to_datetime(test_data['Timestamp_ms'], unit='ms')
test_data = test_data.set_index('Timestamp_ms')
#Resample index to start at 0, resample to the highest resolution 1ms, then resample to 800ms
test_data = test_data.reindex(
pd.date_range(start=pd.to_datetime(0, unit='ms'), end=test_data.index.max(), freq='ms')
)
test_data = test_data.resample('1ms').interpolate().resample('40ms').interpolate()
#Round ms to intergers
test_data.xpos = test_data..round()
Which gives me this:这给了我这个:
ID Condition_num Condition_type Rating
1970-01-01 00:00:00.000 NaN NaN NaN NaN
1970-01-01 00:00:00.040 101.0 1.000000 NaN 58.333333
1970-01-01 00:00:00.080 101.0 1.000000 NaN 63.000000
1970-01-01 00:00:00.120 101.0 1.000000 Active 70.000000
1970-01-01 00:00:00.160 101.0 1.333333 NaN 75.833333
1970-01-01 00:00:00.200 101.0 2.666667 NaN 59.166667
1970-01-01 00:00:00.240 101.0 3.000000 Active_2 60.000000
1970-01-01 00:00:00.280 101.0 3.000000 NaN 65.333333
1970-01-01 00:00:00.320 101.0 3.666667 NaN 69.666667
1970-01-01 00:00:00.360 101.0 5.000000 Active_3 69.000000
1970-01-01 00:00:00.400 101.0 5.000000 NaN 64.000000
1970-01-01 00:00:00.440 101.0 5.000000 NaN 44.000000
1970-01-01 00:00:00.480 101.0 5.000000 Active_3 43.000000
The only issue is I cannot figure out which ratings are happening during the "Active" conditions and whether the ratings I am seeing are caused by extrapolations of the "breaks" where there are no ratings.唯一的问题是我无法弄清楚在“活动”条件下发生了哪些收视率,以及我看到的收视率是否是由没有收视率的“中断”外推引起的。 In so many words, I want the interpolation in the "Active" blocks but also have everything aligned to the beginning of the whole data set.
总而言之,我希望在“活动”块中进行插值,但也要让所有内容都与整个数据集的开头对齐。
I have tried entering Zero ratings for NaN and interpolating from the top of each condition, but that seems only to make the problem worse by altering the ratings more.我尝试为 NaN 输入零评级并从每个条件的顶部进行插值,但这似乎只会通过更多地改变评级来使问题变得更糟。
Any advice would be greatly appreciated!任何建议将不胜感激!
I think you need to do all of your logic inside of a groupby, IIUC:我认为您需要在 IIUC 的 groupby 中完成所有逻辑:
mask = df.Condition_type.ne('Break')
df2 = (df[mask].groupby('Condition_type') # Groupby Condition_type, excluding "Break" rows.
.apply(lambda x: x.resample('1ms') # To each group... resample it.
.interpolate() # Interpolate
.ffill() # Fill values, this just applies to the Condition_type.
.resample('40ms')# Resample to 40ms
.asfreq()) # No need to interpolate in this direction.
.reset_index('Condition_type', drop=True)) # We no longer need this extra index~
# Force the index to our resample'd interval, this will reveal the breaks:
df2 = df2.asfreq('40ms')
print(df2)
Output: Output:
ID Condition_num Condition_type Rating
Timestamp_ms
1970-01-01 00:00:00.000 NaN NaN NaN NaN
1970-01-01 00:00:00.040 101.0 1.0 Active 58.333333
1970-01-01 00:00:00.080 101.0 1.0 Active 63.000000
1970-01-01 00:00:00.120 101.0 1.0 Active 70.000000
1970-01-01 00:00:00.160 NaN NaN NaN NaN
1970-01-01 00:00:00.200 NaN NaN NaN NaN
1970-01-01 00:00:00.240 101.0 3.0 Active_2 60.000000
1970-01-01 00:00:00.280 101.0 3.0 Active_2 65.333333
1970-01-01 00:00:00.320 NaN NaN NaN NaN
1970-01-01 00:00:00.360 101.0 5.0 Active_3 69.000000
1970-01-01 00:00:00.400 101.0 5.0 Active_3 64.000000
1970-01-01 00:00:00.440 101.0 5.0 Active_3 44.000000
1970-01-01 00:00:00.480 101.0 5.0 Active_3 43.000000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.