[英]Grouby and fill missing months in multiple columns data frame in Python
对于这样的数据框,如何将id
分组并填充缺失的月份,同时将缺失月份的price
保持为na
,预期日期范围是2015/1/1
至2019/8/1
。
city district id price date
0 hz xs 20101 1.5 2019/8/1
1 hz xs 20101 50.0 2019/7/1
2 hz xs 20101 2.0 2019/6/1
3 hz xs 20101 2.2 2019/5/1
4 hz sn 20101 2.2 2019/4/1
5 hz sn 20102 2.1 2018/10/1
6 hz sn 20102 70.0 2019/3/1
7 hz sn 20102 2.2 2019/2/1
8 hz sn 20102 nan 2019/1/1
9 hz sn 20102 2.0 2018/12/1
10 hz sn 20102 2.2 2018/11/1
11 xz pd 20103 2.9 2015/7/1
12 xz pd 20103 2.0 2015/8/1
13 xz pd 20103 2.5 2015/9/1
14 xz pd 20103 3.0 2015/10/1
15 xz pd 20103 35.0 2015/11/1
16 xz pd 20103 3.2 2015/12/1
17 xz pd 20103 3.1 2016/1/1
18 xz pd 20103 nan 2016/2/1
19 xz pd 20103 nan 2016/3/1
20 xz pd 20103 nan 2016/4/1
编辑:
在实际数据中,每列city
, district
, id
, date
都是必需的唯一值:
df = df.groupby(['city','district','id', 'date'], as_index=False)['price'].sum()
如果需要按id
列分组:
rng = pd.date_range('2015-01-01','2019-08-01', freq='MS')
df['date'] = pd.to_datetime(df['date'])
df1 = (df.set_index('date')
.groupby('id')
.apply(lambda x: x.reindex(rng))
.rename_axis(('id','date'))
.drop('id', axis=1)
.reset_index()
)
print (df1)
id date city district price
0 20101 2015-01-01 NaN NaN NaN
1 20101 2015-02-01 NaN NaN NaN
2 20101 2015-03-01 NaN NaN NaN
3 20101 2015-04-01 NaN NaN NaN
4 20101 2015-05-01 NaN NaN NaN
.. ... ... ... ... ...
163 20103 2019-04-01 NaN NaN NaN
164 20103 2019-05-01 NaN NaN NaN
165 20103 2019-06-01 NaN NaN NaN
166 20103 2019-07-01 NaN NaN NaN
167 20103 2019-08-01 NaN NaN NaN
[168 rows x 5 columns]
另外,如果需要按更多列分组:
rng = pd.date_range('2015-01-01','2019-08-01', freq='MS')
df['date'] = pd.to_datetime(df['date'])
df2 = (df.set_index('date')
.groupby(['city','district','id'])['price']
.apply(lambda x: x.reindex(rng, fill_value=0))
.rename_axis(('city','district','id','date'))
.reset_index()
)
print (df2)
city district id date price
0 hz sn 20101 2015-01-01 0.0
1 hz sn 20101 2015-02-01 0.0
2 hz sn 20101 2015-03-01 0.0
3 hz sn 20101 2015-04-01 0.0
4 hz sn 20101 2015-05-01 0.0
.. ... ... ... ... ...
219 xz pd 20103 2019-04-01 0.0
220 xz pd 20103 2019-05-01 0.0
221 xz pd 20103 2019-06-01 0.0
222 xz pd 20103 2019-07-01 0.0
223 xz pd 20103 2019-08-01 0.0
[224 rows x 5 columns]
将reindex
与以月开始的MS
一起使用,并将pd.concat
与GroupBy
pd.concat
使用:
dates = pd.date_range('2015-01-01','2019-08-01', freq='MS')
new = pd.concat([
d.set_index('date').reindex(dates).reset_index().rename(columns={'index':'date'}) for _, d in df.groupby('id')
], ignore_index=True)
new = new.ffill().bfill()
输出量
date city district id price
0 2015-01-01 hz sn 20101.0 2.2
1 2015-02-01 hz sn 20101.0 2.2
2 2015-03-01 hz sn 20101.0 2.2
3 2015-04-01 hz sn 20101.0 2.2
4 2015-05-01 hz sn 20101.0 2.2
.. ... ... ... ... ...
163 2019-04-01 xz pd 20103.0 3.1
164 2019-05-01 xz pd 20103.0 3.1
165 2019-06-01 xz pd 20103.0 3.1
166 2019-07-01 xz pd 20103.0 3.1
167 2019-08-01 xz pd 20103.0 3.1
[168 rows x 5 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.