[英]Python: fill missing dates for each group
I have a DataFrame which looks like this: 我有一个DataFrame,如下所示:
x = pd.DataFrame({'user': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b','b'], 'rd': ['2016-01-01', '2016-01-01' ,
'2016-02-01', '2016-02-01', '2016-02-01', '2016-05-01', '2016-05-01',
'2016-06-01','2016-06-01', '2016-06-01'],
'fd' : ['2016-02-01', '2016-04-01', '2016-03-01', '2016-04-01', '2016-05-01',
'2016-06-01', '2016-07-01', '2016-08-01', '2016-07-01', '2016-09-01'],
'val': [3, 4, 16, 7, 9, 2, 5, 11, 20, 1]})
x.head(6)
fd rd user val
0 2016-02-01 2016-01-01 a 3
1 2016-04-01 2016-01-01 a 4
2 2016-03-01 2016-02-01 a 16
3 2016-04-01 2016-02-01 a 7
4 2016-05-01 2016-02-01 a 9
5 2016-06-01 2016-05-01 b 2
x['rd'] = pd.to_datetime(x['rd'])
x['fd'] = pd.to_datetime(x['fd'])
For each rd date I would like to have the next 3 months dates. 对于每个rd日期,我希望接下来的3个月日期。 For instance: 例如:
rd = 2016-01-01
I would like to have: 我想拥有:
fd = [2016-02-01, 2016-03-01, 2016-04-01]
Basically: for each rd date I want the next 3 months as fd dates. 基本上:对于每个rd日期,我希望接下来的3个月作为fd日期。 In my dataset I have missing dates both in rd (2016-03-01, 2016-04-01)
and in fd once I have the rd date (rd = 2016-01-01, fd missing = 2016-03-01)
. 在我的数据集中,我在rd (2016-03-01, 2016-04-01)
和fd中缺少日期(2016-03-01, 2016-04-01)
一旦我有rd日期(rd = 2016-01-01, fd missing = 2016-03-01)
。
Furthermore I have 2 different users x['user'].unique() = ['a', 'b']
. 此外,我有2个不同的用户x['user'].unique() = ['a', 'b']
。 So I may have missing dates (both 'rd' and 'fd') in one user, in the other or in both. 所以我可能在一个用户,另一个用户或两者中都缺少日期('rd'和'fd')。
What I would like to achieve is an efficient way to get a dataframe with all dates for all users. 我想要实现的是一种有效的方法来获取包含所有用户的所有日期的数据帧。
The question starts from an already answered one Question , but the problem here is a little more complex, since I'm not able to fit Multiindex to the problem at hand. 这个问题从已经回答了一个启动的问题 ,但这里的问题是稍微复杂一点,因为我不能够适应多指标的问题在眼前。
What I did until now was to create the 2 column of dates: 我到目前为止所做的是创建2列日期:
index = pd.date_range(x['rd'].min(),
x['rd'].max(), freq='MS')
from datetime import datetime
from dateutil.relativedelta import relativedelta
def add_months(date):
fcs_dates = [date + relativedelta(months = 1), date + relativedelta(months = 2), date + relativedelta(months = 3)]
return fcs_dates
fcs_dates = list(map(lambda x: add_months(x), index.tolist()))
fcs_dates = [j for i in fcs_dates for j in i]
index3 = index.tolist()*3
index3.sort()
So the output is: 所以输出是:
list(zip(index3, fcs_dates))[:5]
[(Timestamp('2016-01-01 00:00:00', freq='MS'),
Timestamp('2016-02-01 00:00:00', freq='MS')),
(Timestamp('2016-01-01 00:00:00', freq='MS'),
Timestamp('2016-03-01 00:00:00', freq='MS')),
(Timestamp('2016-01-01 00:00:00', freq='MS'),
Timestamp('2016-04-01 00:00:00', freq='MS')),
(Timestamp('2016-02-01 00:00:00', freq='MS'),
Timestamp('2016-03-01 00:00:00', freq='MS')),
(Timestamp('2016-02-01 00:00:00', freq='MS'),
Timestamp('2016-04-01 00:00:00', freq='MS'))]
Unfortunately I have no clue about how to plug this into MultiIndex function. 不幸的是,我不知道如何将其插入MultiIndex函数。
Thank you for your help 谢谢您的帮助
I'm having a lot of trouble understanding your question, and I can't get index3 to work in python 3. 我在理解你的问题时遇到了很多麻烦,我无法让index3在python 3中运行。
Are you looking for something along these lines? 你在寻找这些方面的东西吗?
indx = pd.MultiIndex.from_product([['a', 'b'], [index], [pd.DatetimeIndex(fcs_dates)]])
If you're able to construct the levels you want in your multi-index, from_product takes their cartesian product to create the index. 如果您能够在多索引中构建所需的级别,from_product将使用其笛卡尔积来创建索引。
So, I solved my own question by doing a left join for each group (user), where the left dataframe is the one constructed with dates. 所以,我通过为每个组(用户)执行左连接来解决我自己的问题,其中左数据帧是使用日期构造的数据帧。
pd.DataFrame with dates: 带有日期的pd.DataFrame:
left_df = pd.DataFrame({'rd' : index_3, 'fd' : fcs_dates})
left_df['rd'] = left_df['rd'].astype(str)
left_df['fd'] = left_df['fd'].astype(str)
grouped by user DataFrame: 按用户DataFrame分组:
df_gr = x.groupby(['user'])
list_gr = []
for i, gr in df_gr:
gr_new = pd.merge(left_df, gr, left_on= ['rd', 'fd'],
right_on = ['rd', 'fd'],
how = 'left')
list_gr.append(gr_new)
df_final = pd.concat(list_gr)
final dataframe: 最终数据帧:
fd rd user val
0 2016-02-01 2016-01-01 a 3.0
1 2016-03-01 2016-01-01 NaN NaN
2 2016-04-01 2016-01-01 a 4.0
3 2016-03-01 2016-02-01 a 16.0
4 2016-04-01 2016-02-01 a 7.0
5 2016-05-01 2016-02-01 a 9.0
6 2016-04-01 2016-03-01 NaN NaN
7 2016-05-01 2016-03-01 NaN NaN
8 2016-06-01 2016-03-01 NaN NaN
9 2016-05-01 2016-04-01 NaN NaN
10 2016-06-01 2016-04-01 NaN NaN
11 2016-07-01 2016-04-01 NaN NaN
12 2016-06-01 2016-05-01 NaN NaN
13 2016-07-01 2016-05-01 NaN NaN
14 2016-08-01 2016-05-01 NaN NaN
15 2016-07-01 2016-06-01 NaN NaN
16 2016-08-01 2016-06-01 NaN NaN
17 2016-09-01 2016-06-01 NaN NaN
0 2016-02-01 2016-01-01 NaN NaN
1 2016-03-01 2016-01-01 NaN NaN
2 2016-04-01 2016-01-01 NaN NaN
3 2016-03-01 2016-02-01 NaN NaN
4 2016-04-01 2016-02-01 NaN NaN
5 2016-05-01 2016-02-01 NaN NaN
6 2016-04-01 2016-03-01 NaN NaN
7 2016-05-01 2016-03-01 NaN NaN
8 2016-06-01 2016-03-01 NaN NaN
9 2016-05-01 2016-04-01 NaN NaN
10 2016-06-01 2016-04-01 NaN NaN
11 2016-07-01 2016-04-01 NaN NaN
12 2016-06-01 2016-05-01 b 2.0
13 2016-07-01 2016-05-01 b 5.0
14 2016-08-01 2016-05-01 NaN NaN
15 2016-07-01 2016-06-01 b 20.0
16 2016-08-01 2016-06-01 b 11.0
17 2016-09-01 2016-06-01 b 1.0
Unfortunately I don't think this is the quickest method, but I got what I wanted. 不幸的是,我不认为这是最快捷的方法,但我得到了我想要的东西。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.