简体   繁体   English

重新采样并填充pandas中的丢失数据

[英]Resampling and filling missing data in pandas

I have a raw dataset that looks like this: 我有一个原始数据集,如下所示:

df = pd.DataFrame({'speed': [66.8,67,67.1,70,69],
                   'time': ['2017-08-09T05:41:30.168Z', '2017-08-09T05:41:31.136Z', '2017-08-09T05:41:31.386Z', '2017-08-09T05:41:31.103Z','2017-08-09T05:41:35.563Z' ]})

I could do some processing on it to make it look like (removed microseconds): 我可以对它进行一些处理使它看起来像(删除微秒):

df['time']= pd.to_datetime(df.time)
df['time'] = df['time'].apply(lambda x: x.replace(microsecond=0))

>>> df
   speed                time
0   66.8 2017-08-09 05:41:30
1   67.0 2017-08-09 05:41:31
2   67.1 2017-08-09 05:41:31
3   70.0 2017-08-09 05:41:31
4   69.0 2017-08-09 05:41:35

I need to now resample the data so that any entries that arrived at the same timestamp are averaged together, and for the timestamps that did not receive any data, use the last available value. 我现在需要重新采样数据,以便将到达相同时间戳的任何条目一起平均,对于未接收任何数据的时间戳,使用最后一个可用值。 Like: 喜欢:

   speed                time
0   66.80 2017-08-09 05:41:30
1   68.03 2017-08-09 05:41:31
2   70.00 2017-08-09 05:41:32
3   70.00 2017-08-09 05:41:33
4   70.00 2017-08-09 05:41:34
5   69.00 2017-08-09 05:41:35

I understand this might involve the use of groupby and resample, but being a beginner I find myself struggling with these. 我知道这可能涉及使用groupby和resample,但作为一个初学者,我发现自己正在努力解决这些问题。 Any ideas on how to proceed? 关于如何进行的任何想法?

I have tried this but I am getting wrong results: 我试过这个,但结果不对:

df.groupby( [df["time"].dt.second]).mean()
          speed
time           
30    66.800000
31    68.033333
35    69.000000
In [279]: df.resample('1S', on='time').mean().ffill()
Out[279]:
                         speed
time
2017-08-09 05:41:30  66.800000
2017-08-09 05:41:31  68.033333
2017-08-09 05:41:32  68.033333
2017-08-09 05:41:33  68.033333
2017-08-09 05:41:34  68.033333
2017-08-09 05:41:35  69.000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM