简体   繁体   English

如何在 python pandas 中的两个数据帧之间高效搜索?

[英]How to efficiently search between two dataframes in python pandas?

i have two dataframes (in pandas)我有两个数据框(在熊猫中)

df1: df1:

logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10

df2: df2:

id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B

i want new dataframe like new_df:我想要新的 dataframe 像 new_df:

logged_at, item, value, id
2021-01-03 20:01:23, A, 4, 2
2021-01-03 20:01:24, A, 5, 2
2021-01-03 20:01:25, B, 4, 3
2021-01-03 20:01:26, B, 7, 3
2021-01-03 20:01:27, A, 10, 2

What I want is to attach the ID of df2 to the column of df1.我想要的是将 df2 的 ID 附加到 df1 的列。

The condition is that the logged_at time of df1 exists between the start_time and the end_time of df2.条件是df1的logged_at时间存在于df2的start_time和end_time之间。

The number of data in df1 exceeds 900,000 and the number of data in df2 exceeds 100,000. df1中的数据数超过900000,df2中的数据数超过100000。

It takes too long to attach each row of df1.附加 df1 的每一行花费的时间太长。

Is there an efficient way?有没有有效的方法?

A simple merge does what you want with your sample data.一个简单的合并可以满足您对样本数据的要求。

df1 = pd.read_csv(io.StringIO("""logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10"""), skipinitialspace=True)

df2 = pd.read_csv(io.StringIO("""id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B"""), skipinitialspace=True)

new_df = df1.merge(df2.loc[:,["id","item"]], on="item")

output output

           logged_at item  value  id
 2021-01-03 20:01:23    A      4   2
 2021-01-03 20:01:24    A      5   2
 2021-01-03 20:01:27    A     10   2
 2021-01-03 20:01:25    B      4   3
 2021-01-03 20:01:26    B      7   3

pandasql大熊猫

Does what you specified, however your sample data in df2 looks wrong as it gives two rows for each row in df1执行您指定的操作,但是df2中的示例数据看起来错误,因为它为df1中的每一行提供了两行

from pandasql import sqldf
import pandas as pd
import io

df1 = pd.read_csv(io.StringIO("""logged_at, item, value
2021-01-03 20:01:23, A, 4
2021-01-03 20:01:24, A, 5
2021-01-03 20:01:25, B, 4
2021-01-03 20:01:26, B, 7
2021-01-03 20:01:27, A, 10"""), skipinitialspace=True)
df1["logged_at"] = pd.to_datetime(df1["logged_at"])

df2 = pd.read_csv(io.StringIO("""id, start_time, end_time, item
2, 2021-01-03 20:01:00, 2021-01-03 20:05:33, A
3, 2021-01-03 20:01:11, 2021-01-03 21:44:12, B"""), skipinitialspace=True)
df2["start_time"] = pd.to_datetime(df2["start_time"])
df2["end_time"] = pd.to_datetime(df2["end_time"])

pysqldf = lambda q: sqldf(q, globals())
pysqldf("""
select df1.*, df2.*
from df1 
left join df2 on df1.logged_at >= df2.start_time and df1.logged_at <= df2.end_time""")

pandasql output pandasql output

                 logged_at item  value  id                  start_time                    end_time item
 2021-01-03 20:01:23.000000    A      4   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:23.000000    A      4   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:24.000000    A      5   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:24.000000    A      5   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:25.000000    B      4   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:25.000000    B      4   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:26.000000    B      7   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:26.000000    B      7   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B
 2021-01-03 20:01:27.000000    A     10   2  2021-01-03 20:01:00.000000  2021-01-03 20:05:33.000000    A
 2021-01-03 20:01:27.000000    A     10   3  2021-01-03 20:01:11.000000  2021-01-03 21:44:12.000000    B

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM