简体   繁体   English

pandas DataFrame以每组为基础插入/重新采样每日数据

[英]pandas DataFrame interpolating/resampling daily data on a per-group basis

I've got a dataframe that looks like this: 我有一个看起来像这样的数据框:

userid      date          count
a           2016-12-01    4
a           2016-12-03    5
a           2016-12-05    1
b           2016-11-17    14
b           2016-11-18    15
b           2016-11-23    4

The first column is a user id, the second column is a date (resulting from a groupby(pd.TimeGrouper('d')), and the third column is a daily count. However, per user, I would like to ensure that any days missing between a user's min and max date are filled in to be 0 on a per user basis. So if I am starting with a data frame like the above, I end up with a data frame like this: 第一列是用户ID,第二列是日期(由groupby(pd.TimeGrouper('d')生成),第三列是每日计数。但是,对于每个用户,我想确保用户的最小和最大日期之间缺少的任何日期都按每个用户填充为0。因此,如果我从上面的数据框开始,我最终会得到如下数据框:

   userid      date          count
    a           2016-12-01    4
    a           2016-12-02    0
    a           2016-12-03    5
    a           2016-12-04    0
    a           2016-12-05    1
    b           2016-11-17    14
    b           2016-11-18    15
    b           2016-11-19    0
    b           2016-11-20    0
    b           2016-11-21    0
    b           2016-11-22    0
    b           2016-11-23    4

I know that there are various methods available with a pandas data frame to resample (with options to pick to interpolate forwards, backwards, or by averaging) but how would I do this in the sense above, where I want a continuous time series for each userid but where the dates of the time series are different per user? 我知道有许多方法可以使用pandas数据帧进行重新采样(使用选项来选择向前,向后或平均进行插值)但是我将如何在上面的意义上执行此操作,我希望每个方法都有连续的时间序列userid但是每个用户的时间序列的日期不同?

Here's what I tried that hasn't worked: 这是我尝试过的没有用的东西:

grouped_users = user_daily_counts.groupby('user').set_index('timestamp').resample('d', fill_method = None)

However this throws an error AttributeError: Cannot access callable attribute 'set_index' of 'DataFrameGroupBy' objects, try using the 'apply' method . 但是这会引发错误AttributeError: Cannot access callable attribute 'set_index' of 'DataFrameGroupBy' objects, try using the 'apply' method I'm not sure how I'd be able to use the apply method while bringing forward all columns as I'd like to do. 我不知道如何在提出所有列时使用apply方法,就像我想做的那样。

Thanks for any suggestions! 谢谢你的任何建议!

You can use groupby with resample , but first need Datetimeindex created by set_index . 您可以使用groupbyresample ,但首先需要Datetimeindex通过创建set_index
( need pandas 0.18.1 and higher ) need pandas 0.18.1 and higher

Then fill NaN by 0 by asfreq with fillna . 然后通过asfreqfillna填充NaN 0

Last remove column userid and reset_index : 最后删除列useridreset_index

df = df.set_index('date')
       .groupby('userid')
       .resample('D')
       .asfreq()
       .fillna(0)
       .drop('userid', axis=1)
       .reset_index()

print (df)
   userid       date  count
0       a 2016-12-01    4.0
1       a 2016-12-02    0.0
2       a 2016-12-03    5.0
3       a 2016-12-04    0.0
4       a 2016-12-05    1.0
5       b 2016-11-17   14.0
6       b 2016-11-18   15.0
7       b 2016-11-19    0.0
8       b 2016-11-20    0.0
9       b 2016-11-21    0.0
10      b 2016-11-22    0.0
11      b 2016-11-23    4.0

If want dtype of column count integer add astype : 如果想要astypecount整数添加astype

df = df.set_index('date') \
       .groupby('userid') \
       .resample('D') \
       .asfreq() \
       .fillna(0) \
       .drop('userid', axis=1) \
       .astype(int) \
       .reset_index()

print (df)
   userid       date  count
0       a 2016-12-01      4
1       a 2016-12-02      0
2       a 2016-12-03      5
3       a 2016-12-04      0
4       a 2016-12-05      1
5       b 2016-11-17     14
6       b 2016-11-18     15
7       b 2016-11-19      0
8       b 2016-11-20      0
9       b 2016-11-21      0
10      b 2016-11-22      0
11      b 2016-11-23      4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM