简体   繁体   English

仅在数据框(熊猫)中填充MISSING值

[英]Fill MISSING values only in a dataframe (pandas)

What I have in a dataframe: 我在数据框中有什么:

email    user_name    sessions    ymo
a@a.com    JD    1    2015-03-01
a@a.com    JD    2    2015-05-01

What I need: 我需要的:

email    user_name    sessions    ymo
a@a.com    JD    0    2015-01-01
a@a.com    JD    0    2015-02-01
a@a.com    JD    1    2015-03-01
a@a.com    JD    0    2015-04-01
a@a.com    JD    2    2015-05-01
a@a.com    JD    0    2015-06-01
a@a.com    JD    0    2015-07-01
a@a.com    JD    0    2015-08-01
a@a.com    JD    0    2015-09-01
a@a.com    JD    0    2015-10-01
a@a.com    JD    0    2015-11-01
a@a.com    JD    0    2015-12-01

ymo column are pd.Timestamp s: ymo列是pd.Timestamp S:

all_ymo

[Timestamp('2015-01-01 00:00:00'),
 Timestamp('2015-02-01 00:00:00'),
 Timestamp('2015-03-01 00:00:00'),
 Timestamp('2015-04-01 00:00:00'),
 Timestamp('2015-05-01 00:00:00'),
 Timestamp('2015-06-01 00:00:00'),
 Timestamp('2015-07-01 00:00:00'),
 Timestamp('2015-08-01 00:00:00'),
 Timestamp('2015-09-01 00:00:00'),
 Timestamp('2015-10-01 00:00:00'),
 Timestamp('2015-11-01 00:00:00'),
 Timestamp('2015-12-01 00:00:00')]

Unfortunately, this answer: Adding values for missing data combinations in Pandas is not good as it creates duplicates for existing ymo values. 不幸的是,这个答案是: 在Pandas中为缺失的数据组合添加值并不好,因为它会为现有的ymo值创建重复项。

I tried something like this, but it is extremely slow: 我尝试了类似的方法,但是速度非常慢:

for em in all_emails:
    existent_ymo = fill_ymo[fill_ymo['email'] == em]['ymo']
    existent_ymo = set([pd.Timestamp(datetime.date(t.year, t.month, t.day)) for t in existent_ymo])
    missing_ymo = list(existent_ymo - all_ymo)
    multi_ind = pd.MultiIndex.from_product([[em], missing_ymo], names=col_names)
    fill_ymo = sessions.set_index(col_names).reindex(multi_ind, fill_value=0).reset_index()
  • generate month beginning dates and reindex 生成月份开始日期并reindex
  • ffill and bfill columns ['email', 'user_name'] ffillbfill['email', 'user_name']
  • fillna(0) for column 'sessions' 'sessions' fillna(0)

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin()

df1 = df.set_index('ymo').reindex(mbeg)

df1[['email', 'user_name']] = df1[['email', 'user_name']].ffill().bfill()
df1['sessions'] = df1['sessions'].fillna(0).astype(int)

df1

在此处输入图片说明

I try create more general solution with periods : 我尝试使用periods创建更通用的解决方案:

print (df)
     email user_name  sessions        ymo
0  a@a.com        JD         1 2015-03-01
1  a@a.com        JD         2 2015-05-01
2  b@b.com        AB         1 2015-03-01
3  b@b.com        AB         2 2015-05-01


mbeg = pd.period_range('2015-01', periods=12, freq='M')
print (mbeg)
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04', '2015-05', '2015-06',
             '2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12'],
            dtype='int64', freq='M')
#convert column ymo to period
df.ymo = df.ymo.dt.to_period('m')
#groupby and reindex with filling 0
df = df.groupby(['email','user_name'])
       .apply(lambda x: x.set_index('ymo')
       .reindex(mbeg, fill_value=0)
       .drop(['email','user_name'], axis=1))
       .rename_axis(('email','user_name','ymo'))
       .reset_index()
print (df)

      email user_name     ymo  sessions
0   a@a.com        JD 2015-01         0
1   a@a.com        JD 2015-02         0
2   a@a.com        JD 2015-03         1
3   a@a.com        JD 2015-04         0
4   a@a.com        JD 2015-05         2
5   a@a.com        JD 2015-06         0
6   a@a.com        JD 2015-07         0
7   a@a.com        JD 2015-08         0
8   a@a.com        JD 2015-09         0
9   a@a.com        JD 2015-10         0
10  a@a.com        JD 2015-11         0
11  a@a.com        JD 2015-12         0
12  b@b.com        AB 2015-01         0
13  b@b.com        AB 2015-02         0
14  b@b.com        AB 2015-03         1
15  b@b.com        AB 2015-04         0
16  b@b.com        AB 2015-05         2
17  b@b.com        AB 2015-06         0
18  b@b.com        AB 2015-07         0
19  b@b.com        AB 2015-08         0
20  b@b.com        AB 2015-09         0
21  b@b.com        AB 2015-10         0
22  b@b.com        AB 2015-11         0
23  b@b.com        AB 2015-12         0

Then if need datetimes use to_timestamp : 然后,如果需要datetimes使用to_timestamp

df.ymo = df.ymo.dt.to_timestamp()
print (df)
      email user_name        ymo  sessions
0   a@a.com        JD 2015-01-01         0
1   a@a.com        JD 2015-02-01         0
2   a@a.com        JD 2015-03-01         1
3   a@a.com        JD 2015-04-01         0
4   a@a.com        JD 2015-05-01         2
5   a@a.com        JD 2015-06-01         0
6   a@a.com        JD 2015-07-01         0
7   a@a.com        JD 2015-08-01         0
8   a@a.com        JD 2015-09-01         0
9   a@a.com        JD 2015-10-01         0
10  a@a.com        JD 2015-11-01         0
11  a@a.com        JD 2015-12-01         0
12  b@b.com        AB 2015-01-01         0
13  b@b.com        AB 2015-02-01         0
14  b@b.com        AB 2015-03-01         1
15  b@b.com        AB 2015-04-01         0
16  b@b.com        AB 2015-05-01         2
17  b@b.com        AB 2015-06-01         0
18  b@b.com        AB 2015-07-01         0
19  b@b.com        AB 2015-08-01         0
20  b@b.com        AB 2015-09-01         0
21  b@b.com        AB 2015-10-01         0
22  b@b.com        AB 2015-11-01         0
23  b@b.com        AB 2015-12-01         0

Solution with datetimes: 日期时间的解决方案:

print (df)
     email user_name  sessions        ymo
0  a@a.com        JD         1 2015-03-01
1  a@a.com        JD         2 2015-05-01
2  b@b.com        AB         1 2015-03-01
3  b@b.com        AB         2 2015-05-01

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin()

df = df.groupby(['email','user_name'])
        .apply(lambda x: x.set_index('ymo')
        .reindex(mbeg, fill_value=0)
        .drop(['email','user_name'], axis=1))
        .rename_axis(('email','user_name','ymo'))
        .reset_index()
print (df)
      email user_name        ymo  sessions
0   a@a.com        JD 2015-01-01         0
1   a@a.com        JD 2015-02-01         0
2   a@a.com        JD 2015-03-01         1
3   a@a.com        JD 2015-04-01         0
4   a@a.com        JD 2015-05-01         2
5   a@a.com        JD 2015-06-01         0
6   a@a.com        JD 2015-07-01         0
7   a@a.com        JD 2015-08-01         0
8   a@a.com        JD 2015-09-01         0
9   a@a.com        JD 2015-10-01         0
10  a@a.com        JD 2015-11-01         0
11  a@a.com        JD 2015-12-01         0
12  b@b.com        AB 2015-01-01         0
13  b@b.com        AB 2015-02-01         0
14  b@b.com        AB 2015-03-01         1
15  b@b.com        AB 2015-04-01         0
16  b@b.com        AB 2015-05-01         2
17  b@b.com        AB 2015-06-01         0
18  b@b.com        AB 2015-07-01         0
19  b@b.com        AB 2015-08-01         0
20  b@b.com        AB 2015-09-01         0
21  b@b.com        AB 2015-10-01         0
22  b@b.com        AB 2015-11-01         0
23  b@b.com        AB 2015-12-01         0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM