[英]How to fill missing values in DataFrame using another DataFrame in Pandas
my df
looks like this:我的df
看起来像这样:
sprint sprint_created
------ -----------
S100 2020-01-01
S101 2020-01-10
NULL 2020-01-20
NULL 2020-01-31
S101 2020-01-10
...
in the above df
, you can see that some of the sprint
values are NULL
在上面的df
中,您可以看到一些sprint
值是NULL
I have another df2
that has sprint
date ranges:我有另一个具有sprint
日期范围的df2
:
sprint sprint_start sprint_end
------ ----------- ----------
S100 2020-01-01 2020-01-09
S101 2020-01-10 2020-01-19
S102 2020-01-20 2020-01-29
S103 2020-01-30 2020-02-09
S104 2020-02-10 2020-02-19
...
How can I map these data and fill in the NULL
values in the df
by comparing the data in the df2
?如何通过比较df2
中的数据来 map 这些数据并填写df
中的NULL
值?
Please note that the shape of df
and df2
are different.请注意df
和df2
的形状不同。
I assummed duplicated sprint in df(first dataframe can be dropped).我假设 df 中有重复的 sprint(可以删除第一个 dataframe)。 Please advice otherwise if not so.如果不是这样,请另外提出建议。 I use merge asof with one day tolerance based on my comparison of the two dfs you provided.根据我对您提供的两个 dfs 的比较,我使用了带有一天容差的合并 asof。 Advice otherwise if so否则建议
df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y']).dropna()
sprint sprint_created
0 S100 2020-01-01
1 S101 2020-01-10
2 S102 2020-01-20
3 S103 2020-01-31
If your frame has legit multiple sprints as explained above in comments.如果您的框架有合法的多次冲刺,如评论中所述。 Please try;请试试;
g=df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y'])
g.loc[g.sprint.isna(), 'sprint']=g.groupby('sprint_created').sprint.ffill()
print(g)
sprint sprint_created
0 S100 2020-01-01
1 S101 2020-01-10
2 S102 2020-01-20
3 S103 2020-01-31
4 S101 2020-01-10
One way would be to melt
and resample
your df2
and create a dictionary to map
back to df1
:一种方法是melt
并resample
您的df2
并创建一个字典到map
回到df1
:
#make sure columns are in datetime format
df1['sprint_created'] = pd.to_datetime(df1['sprint_created'])
df2['sprint_start'] = pd.to_datetime(df2['sprint_start'])
df2['sprint_end'] = pd.to_datetime(df2['sprint_end'])
#melt dataframe of the two date columns and resample by group
new = (df2.melt(id_vars='sprint').drop('variable', axis=1).set_index('value')
.groupby('sprint', group_keys=False).resample('D').ffill().reset_index())
#create dictionary of date and the sprint and map back to df1
dct = dict(zip(new['value'], new['sprint']))
df1['sprint'] = df1['sprint_created'].map(dct)
#or df1['sprint'] = df1['sprint'].fillna(df1['sprint_created'].map(dct))
df1
Out[1]:
sprint sprint_created
0 S100 2020-01-01
1 S101 2020-01-10
2 S102 2020-01-20
3 S103 2020-01-31
4 S101 2020-01-10
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.