简体   繁体   English

如何使用 Pandas 中的另一个 DataFrame 填充 DataFrame 中的缺失值

[英]How to fill missing values in DataFrame using another DataFrame in Pandas

my df looks like this:我的df看起来像这样:

sprint   sprint_created
------   -----------
S100     2020-01-01    
S101     2020-01-10
NULL     2020-01-20
NULL     2020-01-31
S101     2020-01-10
...

in the above df , you can see that some of the sprint values are NULL在上面的df中,您可以看到一些sprint值是NULL

I have another df2 that has sprint date ranges:我有另一个具有sprint日期范围的df2

sprint   sprint_start   sprint_end
------   -----------    ----------
S100     2020-01-01     2020-01-09    
S101     2020-01-10     2020-01-19  
S102     2020-01-20     2020-01-29  
S103     2020-01-30     2020-02-09  
S104     2020-02-10     2020-02-19  
...

How can I map these data and fill in the NULL values in the df by comparing the data in the df2 ?如何通过比较df2中的数据来 map 这些数据并填写df中的NULL值?

Please note that the shape of df and df2 are different.请注意dfdf2的形状不同。

I assummed duplicated sprint in df(first dataframe can be dropped).我假设 df 中有重复的 sprint(可以删除第一个 dataframe)。 Please advice otherwise if not so.如果不是这样,请另外提出建议。 I use merge asof with one day tolerance based on my comparison of the two dfs you provided.根据我对您提供的两个 dfs 的比较,我使用了带有一天容差的合并 asof。 Advice otherwise if so否则建议

df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y']).dropna()

  sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31

If your frame has legit multiple sprints as explained above in comments.如果您的框架有合法的多次冲刺,如评论中所述。 Please try;请试试;

g=df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y'])
g.loc[g.sprint.isna(), 'sprint']=g.groupby('sprint_created').sprint.ffill()
print(g)



sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31
4   S101     2020-01-10

One way would be to melt and resample your df2 and create a dictionary to map back to df1 :一种方法是meltresample您的df2并创建一个字典到map回到df1

#make sure columns are in datetime format
df1['sprint_created'] = pd.to_datetime(df1['sprint_created'])
df2['sprint_start'] = pd.to_datetime(df2['sprint_start'])
df2['sprint_end'] = pd.to_datetime(df2['sprint_end'])

#melt dataframe of the two date columns and resample by group
new = (df2.melt(id_vars='sprint').drop('variable', axis=1).set_index('value')
          .groupby('sprint', group_keys=False).resample('D').ffill().reset_index())

#create dictionary of date and the sprint and map back to df1
dct = dict(zip(new['value'], new['sprint']))
df1['sprint'] = df1['sprint_created'].map(dct)
#or df1['sprint'] = df1['sprint'].fillna(df1['sprint_created'].map(dct))
df1
Out[1]: 
  sprint sprint_created
0   S100     2020-01-01
1   S101     2020-01-10
2   S102     2020-01-20
3   S103     2020-01-31
4   S101     2020-01-10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM