[英]Pandas dataframe how to merge 2 dfs based on timedelta?
I have two dataframes:我有两个数据框:
df1 = a1 a2 recorded_at
1. 2. 2020-03-18 00:00:01
8. 1. 2021-04-15 04:00:10
9. 0. 2021-03-18 12:40:30
df2 = b1 b2 DateTime
7. 8. 2020-03-18 00:00:01
2. 4. 2020-03-18 00:00:04
2. 6. 2021-04-15 04:00:12
4. 2. 2021-03-18 12:40:40
I want to merge them by comparing recorded_at
to DateTime
, and taking all rows that within 4 seconds after.我想通过比较
recorded_at
和DateTime
来合并它们,然后在 4 秒内获取所有行。 So I will get所以我会得到
df_new = a1 a2 recorded_at DateTime b1 b2
1. 2. 2020-03-18 00:00:01 2020-03-18 00:00:01 7 8
1. 2. 2020-03-18 00:00:01 2020-03-18 00:00:04 2 4
8. 1. 2021-04-15 04:00:10 2021-04-15 04:00:12 2 6
How can I do it?我该怎么做? Thanks!
谢谢!
Initialize the dataframes初始化数据框
df1 = pd.DataFrame([
[1.0, 2.0, "2020-03-18 00:00:01"],
[8.0, 1.0, "2021-04-15 04:00:10"],
[19.0, 0.0, "2021-03-18 12:40:30"],
], columns=["a1", "a2", "recorded_at"])
df2 = pd.DataFrame([
[7.0, 8.0, "2020-03-18 00:00:01"],
[2.0, 4.0, "2020-03-18 00:00:04"],
[2.0, 6.0, "2021-04-15 04:00:12"],
[4.0, 2.0, "2021-03-18 12:40:40"],
], columns=["a1", "a2", "recorded_at"])
Convert to pandas datetime转换为熊猫日期时间
df1["recorded_at"] = pd.to_datetime(df1["recorded_at"])
df2["recorded_at"] = pd.to_datetime(df2["recorded_at"])
Merging the df to create combinations合并 df 以创建组合
result = df1.merge(df2, how="cross")
Finding the time delta寻找时间增量
result["diff"] = abs(result["recorded_at_x"] - result["recorded_at_y"])
Extracting the result提取结果
from datetime import timedelta
result[result["diff"] < timedelta(seconds=4)]
Result:结果:
a1_x a2_x recorded_at_x a1_y a2_y recorded_at_y diff
0 1.0 2.0 2020-03-18 00:00:01 7.0 8.0 2020-03-18 00:00:01 0 days 00:00:00
1 1.0 2.0 2020-03-18 00:00:01 2.0 4.0 2020-03-18 00:00:04 0 days 00:00:03
6 8.0 1.0 2021-04-15 04:00:10 2.0 6.0 2021-04-15 04:00:12 0 days 00:00:02
It works for the sample input.它适用于样本输入。 but you may need a better strategy if your data is huge.
但如果您的数据量很大,您可能需要更好的策略。
If you don't expect to have more that one row of df1 that matches a single row of df2, then an efficient solution would be a merge_asof
.如果您不希望有更多的 df1 与单行 df2 匹配,那么一个有效的解决方案将是
merge_asof
。 Else, the merge
computation will be quadratic, so greatly dependent on the size of each input.:否则,
merge
计算将是二次的,因此很大程度上取决于每个输入的大小。:
df1['recorded_at'] = pd.to_datetime(df1['recorded_at'])
df2['DateTime'] = pd.to_datetime(df2['DateTime'])
out = (pd
.merge_asof(df2.sort_values(by='DateTime'), df1.sort_values(by='recorded_at'),
left_on='DateTime', right_on='recorded_at',
direction='backward', tolerance=pd.Timedelta('4s')
)
.dropna(subset=['recorded_at'])
)
output:输出:
b1 b2 DateTime a1 a2 recorded_at
0 7.0 8.0 2020-03-18 00:00:01 1.0 2.0 2020-03-18 00:00:01
1 2.0 4.0 2020-03-18 00:00:04 1.0 2.0 2020-03-18 00:00:01
3 2.0 6.0 2021-04-15 04:00:12 8.0 1.0 2021-04-15 04:00:10
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.