[英]Fill in missing dates of groupby
Imagine I have a dataframe that looks like: 想象一下,我有一个看起来像这样的数据框:
ID DATE VALUE
1 31-01-2006 5
1 28-02-2006 5
1 31-05-2006 10
1 30-06-2006 11
2 31-01-2006 5
2 31-02-2006 5
2 31-03-2006 5
2 31-04-2006 5
As you can see this is panel data with multiple entries on the same date for different IDs. 正如您所看到的,这是在同一日期针对不同ID的多个条目的面板数据。 What I want to do is fill in missing dates for each ID. 我想要做的是填写每个ID的缺失日期。 You can see that for ID "1" there is a jump in months between the second and third entry. 您可以看到,对于ID“1”,第二个和第三个条目之间的月份会有一个跳跃。
I would like a dataframe that looks like: 我想要一个看起来像这样的数据框:
ID DATE VALUE
1 31-01-2006 5
1 28-02-2006 5
1 31-03-2006 NA
1 30-04-2006 NA
1 31-05-2006 10
1 30-06-2006 11
2 31-01-2006 5
2 31-02-2006 5
2 31-03-2006 5
2 31-04-2006 5
I have no idea how to do this since I can not index by date since there are duplicate dates. 我不知道怎么做,因为我不能按日期索引,因为有重复的日期。
One way is to use pivot_table and then unstack: 一种方法是使用pivot_table然后取消堆栈:
In [11]: df.pivot_table("VALUE", "DATE", "ID")
Out[11]:
ID 1 2
DATE
28-02-2006 5.0 NaN
30-06-2006 11.0 NaN
31-01-2006 5.0 5.0
31-02-2006 NaN 5.0
31-03-2006 NaN 5.0
31-04-2006 NaN 5.0
31-05-2006 10.0 NaN
In [12]: df.pivot_table("VALUE", "DATE", "ID").unstack().reset_index()
Out[12]:
ID DATE 0
0 1 28-02-2006 5.0
1 1 30-06-2006 11.0
2 1 31-01-2006 5.0
3 1 31-02-2006 NaN
4 1 31-03-2006 NaN
5 1 31-04-2006 NaN
6 1 31-05-2006 10.0
7 2 28-02-2006 NaN
8 2 30-06-2006 NaN
9 2 31-01-2006 5.0
10 2 31-02-2006 5.0
11 2 31-03-2006 5.0
12 2 31-04-2006 5.0
13 2 31-05-2006 NaN
An alternative, perhaps slightly more efficient way is to reindex from_product: 另一种可能稍微更高效的方法是重新索引from_product:
In [21] df1 = df.set_index(['ID', 'DATE'])
In [22]: df1.reindex(pd.MultiIndex.from_product(df1.index.levels))
Out[22]:
VALUE
1 28-02-2006 5.0
30-06-2006 11.0
31-01-2006 5.0
31-02-2006 NaN
31-03-2006 NaN
31-04-2006 NaN
31-05-2006 10.0
2 28-02-2006 NaN
30-06-2006 NaN
31-01-2006 5.0
31-02-2006 5.0
31-03-2006 5.0
31-04-2006 5.0
31-05-2006 NaN
Another solution is to convert the incomplete data to a "wide" form (a table; this will create cells for the missing values) and then back to a "tall" form. 另一种解决方案是将不完整的数据转换为“宽”形式(表格;这将为缺失值创建单元格),然后返回“高”形式。
df.set_index(['ID','DATE']).unstack().stack(dropna=False).reset_index()
# ID DATE VALUE
#0 1 28-02-2006 5.0
#1 1 30-06-2006 11.0
#2 1 31-01-2006 5.0
#3 1 31-02-2006 NaN
#4 1 31-03-2006 NaN
#5 1 31-04-2006 NaN
#6 1 31-05-2006 10.0
#7 2 28-02-2006 NaN
#....
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.