[英]How to efficiently search between two dataframes in python pandas?
i have two dataframes (in pandas)我有两个数据框(在熊猫中)
df1: df1:
logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10
df2: df2:
id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B
i want new dataframe like new_df:我想要新的 dataframe 像 new_df:
logged_at, item, value, id
2021-01-03 20:01:23, A, 4, 2
2021-01-03 20:01:24, A, 5, 2
2021-01-03 20:01:25, B, 4, 3
2021-01-03 20:01:26, B, 7, 3
2021-01-03 20:01:27, A, 10, 2
What I want is to attach the ID of df2 to the column of df1.我想要的是将 df2 的 ID 附加到 df1 的列。
The condition is that the logged_at time of df1 exists between the start_time and the end_time of df2.条件是df1的logged_at时间存在于df2的start_time和end_time之间。
The number of data in df1 exceeds 900,000 and the number of data in df2 exceeds 100,000. df1中的数据数超过900000,df2中的数据数超过100000。
It takes too long to attach each row of df1.附加 df1 的每一行花费的时间太长。
Is there an efficient way?有没有有效的方法?
A simple merge does what you want with your sample data.一个简单的合并可以满足您对样本数据的要求。
df1 = pd.read_csv(io.StringIO("""logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10"""), skipinitialspace=True)
df2 = pd.read_csv(io.StringIO("""id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B"""), skipinitialspace=True)
new_df = df1.merge(df2.loc[:,["id","item"]], on="item")
logged_at item value id
2021-01-03 20:01:23 A 4 2
2021-01-03 20:01:24 A 5 2
2021-01-03 20:01:27 A 10 2
2021-01-03 20:01:25 B 4 3
2021-01-03 20:01:26 B 7 3
Does what you specified, however your sample data in df2
looks wrong as it gives two rows for each row in df1
执行您指定的操作,但是df2
中的示例数据看起来错误,因为它为df1
中的每一行提供了两行
from pandasql import sqldf
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO("""logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10"""), skipinitialspace=True)
df1["logged_at"] = pd.to_datetime(df1["logged_at"])
df2 = pd.read_csv(io.StringIO("""id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B"""), skipinitialspace=True)
df2["start_time"] = pd.to_datetime(df2["start_time"])
df2["end_time"] = pd.to_datetime(df2["end_time"])
pysqldf = lambda q: sqldf(q, globals())
pysqldf("""
select df1.*, df2.*
from df1
left join df2 on df1.logged_at >= df2.start_time and df1.logged_at <= df2.end_time""")
logged_at item value id start_time end_time item
2021-01-03 20:01:23.000000 A 4 2 2021-01-03 20:01:00.000000 2021-01-03 20:05:33.000000 A
2021-01-03 20:01:23.000000 A 4 3 2021-01-03 20:01:11.000000 2021-01-03 21:44:12.000000 B
2021-01-03 20:01:24.000000 A 5 2 2021-01-03 20:01:00.000000 2021-01-03 20:05:33.000000 A
2021-01-03 20:01:24.000000 A 5 3 2021-01-03 20:01:11.000000 2021-01-03 21:44:12.000000 B
2021-01-03 20:01:25.000000 B 4 2 2021-01-03 20:01:00.000000 2021-01-03 20:05:33.000000 A
2021-01-03 20:01:25.000000 B 4 3 2021-01-03 20:01:11.000000 2021-01-03 21:44:12.000000 B
2021-01-03 20:01:26.000000 B 7 2 2021-01-03 20:01:00.000000 2021-01-03 20:05:33.000000 A
2021-01-03 20:01:26.000000 B 7 3 2021-01-03 20:01:11.000000 2021-01-03 21:44:12.000000 B
2021-01-03 20:01:27.000000 A 10 2 2021-01-03 20:01:00.000000 2021-01-03 20:05:33.000000 A
2021-01-03 20:01:27.000000 A 10 3 2021-01-03 20:01:11.000000 2021-01-03 21:44:12.000000 B
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.