如何从一个 df 在另一个 df 中搜索 substring？

Question

I have read this post and would like to do something similar.我已阅读这篇文章，并想做类似的事情。

I have 2 dfs:我有 2 个 df：

df1: df1:

file_num文件编号	city城市	address_line地址栏
1 1个	Toronto多伦多	123 Fake St 123假街
2 2个	Montreal蒙特利尔	456 Sample Ave样品大道 456 号

df2: df2:

DB_Num数据库编号	Address地址
AB1 AB1	Toronto 123 Fake St多伦多 123 假街
AB3 AB3	789 Random Drive, Toronto 789 Random Drive, 多伦多

I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.我想知道 df2 中的哪个 DB_Num 与 df1 中的 addres_line 和 city 匹配，并包括匹配来自哪个 file_num。

My ideal output is:我理想中的 output 是：

file_num文件编号	city城市	address_line地址栏	DB_Num数据库编号	Address地址
1 1个	Toronto多伦多	123 Fake St 123假街	AB1 AB1	Toronto 123 Fake St多伦多 123 假街

Based on the above linked post, I have made a look ahead regex, and am searching using the insert and str.extract method.基于上面的链接帖子，我做了一个前瞻性的正则表达式，并且正在使用insert和str.extract方法进行搜索。

df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))

Since my address in df2 is entered manually, it is sometimes out of order.由于我在 df2 中的地址是手动输入的，因此有时会出现乱码。

Because it is out of order, I am using the look ahead method of regex.因为乱序，我用的是regex的look ahead方法。

The look ahead method is causing str.extract to not output any value.前瞻方法导致str.extract不是 output 任何值。 Although I can still filter out nulls and it will keep only the correct matches.虽然我仍然可以过滤掉空值，但它只会保留正确的匹配项。

My main problem is I have no way to join back to df1 to get the file_num.我的主要问题是我无法重新加入 df1 以获取 file_num。

I can do this problem with a for loop and iterating each record to search, but it takes too long.我可以用 for 循环并迭代每条记录来解决这个问题，但它需要太长时间。 df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. df1 实际上有 5000 条左右的记录，而 df2 有数百万条记录，所以运行需要 2 个多小时。 Is there a way to leverage vectorization for this problem?有没有办法利用矢量化来解决这个问题？

Thanks!谢谢！

Answer 1

Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:首先创建一个新系列，它是 df2 中的每个“地址”对应于 df1 中的“address_line”的行，如果存在这样的行：

r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df

#output: ＃输出：

0    123 Fake St
1            NaN
Name: Address, dtype: object

Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:现在我们将 df1 合并到“address_line”列，将 df2 合并到“merge_df”系列：

df1.merge(df2, left_on='address_line', right_on=merge_df)

index指数	file_num文件编号	City城市	address_line地址栏	DB_num数据库编号	Address地址
0 0	1.0 1.0	Toronto多伦多	123 Fake St 123假街	AB1 AB1	Toronto 123 Fake St多伦多 123 假街

如何从一个 df 在另一个 df 中搜索 substring？

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-05-27 19:50:04

如何从一个 df 在另一个 df 中搜索 substring？

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-05-27 19:50:04

解决方案1
2 已采纳 2022-05-27 19:50:04