![](/img/trans.png)
[英]I have two data frames df1 and df2, I need to filter out df1 using keys in df2 using start and end dates in df2, I need to get a result like df3
[英]merge two data frame if start and end date of df2 comes in range of start and end date of df1 in python (pandas)
我有两个 dataframe df1 和 df2
df1=
id start end
a 1/12/2022 18/12/2022
a 19/12/2022 25/12/2022
a 26/12/2022 31/12/2022
b 01/12/2022 20/12/2022
b 21/12/2022 31/12/2022
c 01/12/2022 31/12/2022
d 01/12/2022 15/12/2022
d 16/12/2022 31/12/2022
和第二个数据框作为
df2
id start_2 end_2 number
a 15/12/2022 15/12/2022 1
b 17/12/2022 19/12/2022 3
b 25/12/2022 27/12/2022 2
c 12/12/2022 12/12/2022 1
d 03/12/2022 04/12/2022 2
d 25/12/2022 25/12/2022 1
我想通过 id 合并 dataframe 左连接(df1 和 df2)。 并调整 df1 中相同日期范围(开始和结束日期)中的“编号”列。 就像在 df2 id 'a' 中的编号为 1 一样,它应该出现在 'a' 的第一行(1/12/2022 到 18/12/2022)而不是在其他插槽中。 其他插槽应为零。 像下面一样
结果 df
id start end number
a 1/12/2022 18/12/2022 1
a 19/12/2022 25/12/2022 0
a 26/12/2022 31/12/2022 0
b 01/12/2022 20/12/2022 3
b 21/12/2022 31/12/2022 2
c 01/12/2022 31/12/2022 1
d 01/12/2022 15/12/2022 2
d 16/12/2022 31/12/2022 1
注意如果两个数在 df1 的同一个槽中,应该有 groupby sum。
这是一个解决方法。 合并后设置start
和end
条件然后充分利用.loc
和groupby
df1["start"] = pd.to_datetime(df1["start"], dayfirst=True)
df1["end"] = pd.to_datetime(df1["end"], dayfirst=True)
df2["start_2"] = pd.to_datetime(df2["start_2"], dayfirst=True)
df2["end_2"] = pd.to_datetime(df2["end_2"], dayfirst=True)
merged_df = pd.merge(df1, df2, on="id", how="left")
merged_df["number_adj"] = 0
start_condition = (merged_df["start_2"] >= merged_df["start"]) & (merged_df["start_2"] <= merged_df["end"])
end_condition = (merged_df["end_2"] >= merged_df["start"]) & (merged_df["end_2"] <= merged_df["end"])
merged_df.loc[start_condition | end_condition, "number_adj"] = merged_df["number"]
merged_df = merged_df.groupby(["id", "start", "end"]).sum().reset_index()
merged_df.drop("number", axis=1, inplace=True)
merged_df.rename(columns={"number_adj": "number"}, inplace=True)
print(merged_df)
Output:
id start end number
0 a 2022-12-01 2022-12-18 1
1 a 2022-12-19 2022-12-25 0
2 a 2022-12-26 2022-12-31 0
3 b 2022-12-01 2022-12-20 3
4 b 2022-12-21 2022-12-31 2
5 c 2022-12-01 2022-12-31 1
6 d 2022-12-01 2022-12-15 2
7 d 2022-12-16 2022-12-31 1
您可以将 concat 和 groupby 与 size() 方法一起使用。
df = pd.concat([df1, df2])
df.groupby(["start", "end"]).size()
您可以合并id
然后过滤掉您的列表:
# Convert to DatetimeIndex if necessary
df1['start'] = pd.to_datetime(df1['start'], dayfirst=True)
df1['end'] = pd.to_datetime(df1['end'], dayfirst=True)
df2['start_2'] = pd.to_datetime(df2['start_2'], dayfirst=True)
df2['end_2'] = pd.to_datetime(df2['end_2'], dayfirst=True)
# Merge on id, reset_index to preserve original index on merge
out = df1.reset_index().merge(df2, on='id', how='left')
# Check intervals
out['indicator'] = (out['start'] < out['start_2']) & (out['end_2'] < out['end'])
# Filter the list and set to 0 other slots
out = out.loc[out.groupby('index')['indicator'].idxmax()]
out.loc[~out['indicator'], 'number'] = 0
# Get the final dataframe
out = out[df1.columns.tolist() + ['number']].set_index(df1.index)
Output:
>>> out
id start end number
0 a 2022-12-01 2022-12-18 1
1 a 2022-12-19 2022-12-25 0
2 a 2022-12-26 2022-12-31 0
3 b 2022-12-01 2022-12-20 3
4 b 2022-12-21 2022-12-31 2
5 c 2022-12-01 2022-12-31 1
6 d 2022-12-01 2022-12-15 0
7 d 2022-12-16 2022-12-31 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.