[英]Pandas filling missing dates and values within group
我有一個如下所示的數據框
x = pd.DataFrame({'user': ['a','a','b','b'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06'], 'val': [1,33,2,1]})
我想做的是在日期列中找到最小和最大日期,並擴展該列以包含所有日期,同時為val
列填充0
。 所以所需的 output 是
dt user val
0 2016-01-01 a 1
1 2016-01-02 a 33
2 2016-01-03 a 0
3 2016-01-04 a 0
4 2016-01-05 a 0
5 2016-01-06 a 0
6 2016-01-01 b 0
7 2016-01-02 b 0
8 2016-01-03 b 0
9 2016-01-04 b 0
10 2016-01-05 b 2
11 2016-01-06 b 1
初始數據框:
dt user val
0 2016-01-01 a 1
1 2016-01-02 a 33
2 2016-01-05 b 2
3 2016-01-06 b 1
首先,將日期轉換為日期時間:
x['dt'] = pd.to_datetime(x['dt'])
然后,生成日期和唯一用戶:
dates = x.set_index('dt').resample('D').asfreq().index
>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06'],
dtype='datetime64[ns]', name='dt', freq='D')
users = x['user'].unique()
>> array(['a', 'b'], dtype=object)
這將允許您創建一個 MultiIndex:
idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])
>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=['dt', 'user'])
您可以使用它來重新索引您的 DataFrame:
x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out:
dt user val
0 2016-01-01 a 1
1 2016-01-01 b 0
2 2016-01-02 a 33
3 2016-01-02 b 0
4 2016-01-03 a 0
5 2016-01-03 b 0
6 2016-01-04 a 0
7 2016-01-04 b 0
8 2016-01-05 a 0
9 2016-01-05 b 2
10 2016-01-06 a 0
11 2016-01-06 b 1
然后可以按用戶排序:
x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out:
dt user val
0 2016-01-01 a 1
2 2016-01-02 a 33
4 2016-01-03 a 0
6 2016-01-04 a 0
8 2016-01-05 a 0
10 2016-01-06 a 0
1 2016-01-01 b 0
3 2016-01-02 b 0
5 2016-01-03 b 0
7 2016-01-04 b 0
9 2016-01-05 b 2
11 2016-01-06 b 1
正如@ayhan 建議的那樣
x.dt = pd.to_datetime(x.dt)
單線主要使用@ayhan 的想法,同時結合stack
/ unstack
和fill_value
x.set_index(
['dt', 'user']
).unstack(
fill_value=0
).asfreq(
'D', fill_value=0
).stack().sort_index(level=1).reset_index()
dt user val
0 2016-01-01 a 1
1 2016-01-02 a 33
2 2016-01-03 a 0
3 2016-01-04 a 0
4 2016-01-05 a 0
5 2016-01-06 a 0
6 2016-01-01 b 0
7 2016-01-02 b 0
8 2016-01-03 b 0
9 2016-01-04 b 0
10 2016-01-05 b 2
11 2016-01-06 b 1
一個老問題,已經有了很好的答案; 這是一種替代方法,使用來自pyjanitor的完整函數,它可以在生成顯式缺失的行時幫助抽象:
#pip install pyjanitor
import pandas as pd
import janitor as jn
x['dt'] = pd.to_datetime(x['dt'])
# generate complete list of dates
dates = dict(dt = pd.date_range(x.dt.min(), x.dt.max(), freq='1D'))
# build the new dataframe, and fill nulls with 0
x.complete('user', dates, fill_value = 0)
user dt val
0 a 2016-01-01 1
1 a 2016-01-02 33
2 a 2016-01-03 0
3 a 2016-01-04 0
4 a 2016-01-05 0
5 a 2016-01-06 0
6 b 2016-01-01 0
7 b 2016-01-02 0
8 b 2016-01-03 0
9 b 2016-01-04 0
10 b 2016-01-05 2
11 b 2016-01-06 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.