简体   繁体   English

如何从一个 df 在另一个 df 中搜索 substring?

[英]How to search a substring from one df in another df?

I have read this post and would like to do something similar.我已阅读这篇文章,并想做类似的事情。

I have 2 dfs:我有 2 个 df:

df1: df1:

file_num文件编号 city城市 address_line地址栏
1 1个 Toronto多伦多 123 Fake St 123假街
2 2个 Montreal蒙特利尔 456 Sample Ave样品大道 456 号

df2: df2:

DB_Num数据库编号 Address地址
AB1 AB1 Toronto 123 Fake St多伦多 123 假街
AB3 AB3 789 Random Drive, Toronto 789 Random Drive, 多伦多

I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.我想知道 df2 中的哪个 DB_Num 与 df1 中的 addres_line 和 city 匹配,并包括匹配来自哪个 file_num。

My ideal output is:我理想中的 output 是:

file_num文件编号 city城市 address_line地址栏 DB_Num数据库编号 Address地址
1 1个 Toronto多伦多 123 Fake St 123假街 AB1 AB1 Toronto 123 Fake St多伦多 123 假街

Based on the above linked post, I have made a look ahead regex, and am searching using the insert and str.extract method.基于上面的链接帖子,我做了一个前瞻性的正则表达式,并且正在使用insertstr.extract方法进行搜索。

df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))

Since my address in df2 is entered manually, it is sometimes out of order.由于我在 df2 中的地址是手动输入的,因此有时会出现乱码。

Because it is out of order, I am using the look ahead method of regex.因为乱序,我用的是regex的look ahead方法。

The look ahead method is causing str.extract to not output any value.前瞻方法导致str.extract不是 output 任何值。 Although I can still filter out nulls and it will keep only the correct matches.虽然我仍然可以过滤掉空值,但它只会保留正确的匹配项。

My main problem is I have no way to join back to df1 to get the file_num.我的主要问题是我无法重新加入 df1 以获取 file_num。

I can do this problem with a for loop and iterating each record to search, but it takes too long.我可以用 for 循环并迭代每条记录来解决这个问题,但它需要太长时间。 df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. df1 实际上有 5000 条左右的记录,而 df2 有数百万条记录,所以运行需要 2 个多小时。 Is there a way to leverage vectorization for this problem?有没有办法利用矢量化来解决这个问题?

Thanks!谢谢!

Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:首先创建一个新系列,它是 df2 中的每个“地址”对应于 df1 中的“address_line”的行,如果存在这样的行:

r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df

#output: #输出:

0    123 Fake St
1            NaN
Name: Address, dtype: object

Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:现在我们将 df1 合并到“address_line”列,将 df2 合并到“merge_df”系列:

df1.merge(df2, left_on='address_line', right_on=merge_df)
index指数 file_num文件编号 City城市 address_line地址栏 DB_num数据库编号 Address地址
0 0 1.0 1.0 Toronto多伦多 123 Fake St 123假街 AB1 AB1 Toronto 123 Fake St多伦多 123 假街

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM