[英]How to search a substring from one df in another df?
I have read this post and would like to do something similar.我已阅读这篇文章,并想做类似的事情。
I have 2 dfs:我有 2 个 df:
df1: df1:
file_num![]() |
city![]() |
address_line![]() |
---|---|---|
1 ![]() |
Toronto![]() |
123 Fake St ![]() |
2 ![]() |
Montreal![]() |
456 Sample Ave![]() |
df2: df2:
DB_Num![]() |
Address![]() |
---|---|
AB1 ![]() |
Toronto 123 Fake St![]() |
AB3 ![]() |
789 Random Drive, Toronto ![]() |
I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.我想知道 df2 中的哪个 DB_Num 与 df1 中的 addres_line 和 city 匹配,并包括匹配来自哪个 file_num。
My ideal output is:我理想中的 output 是:
file_num![]() |
city![]() |
address_line![]() |
DB_Num![]() |
Address![]() |
---|---|---|---|---|
1 ![]() |
Toronto![]() |
123 Fake St ![]() |
AB1 ![]() |
Toronto 123 Fake St![]() |
Based on the above linked post, I have made a look ahead regex, and am searching using the insert
and str.extract
method.基于上面的链接帖子,我做了一个前瞻性的正则表达式,并且正在使用
insert
和str.extract
方法进行搜索。
df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))
Since my address in df2 is entered manually, it is sometimes out of order.由于我在 df2 中的地址是手动输入的,因此有时会出现乱码。
Because it is out of order, I am using the look ahead method of regex.因为乱序,我用的是regex的look ahead方法。
The look ahead method is causing str.extract
to not output any value.前瞻方法导致
str.extract
不是 output 任何值。 Although I can still filter out nulls and it will keep only the correct matches.虽然我仍然可以过滤掉空值,但它只会保留正确的匹配项。
My main problem is I have no way to join back to df1 to get the file_num.我的主要问题是我无法重新加入 df1 以获取 file_num。
I can do this problem with a for loop and iterating each record to search, but it takes too long.我可以用 for 循环并迭代每条记录来解决这个问题,但它需要太长时间。 df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run.
df1 实际上有 5000 条左右的记录,而 df2 有数百万条记录,所以运行需要 2 个多小时。 Is there a way to leverage vectorization for this problem?
有没有办法利用矢量化来解决这个问题?
Thanks!谢谢!
Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:首先创建一个新系列,它是 df2 中的每个“地址”对应于 df1 中的“address_line”的行,如果存在这样的行:
r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df
#output: #输出:
0 123 Fake St
1 NaN
Name: Address, dtype: object
Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:现在我们将 df1 合并到“address_line”列,将 df2 合并到“merge_df”系列:
df1.merge(df2, left_on='address_line', right_on=merge_df)
index![]() |
file_num![]() |
City![]() |
address_line![]() |
DB_num![]() |
Address![]() |
---|---|---|---|---|---|
0 ![]() |
1.0 ![]() |
Toronto![]() |
123 Fake St ![]() |
AB1 ![]() |
Toronto 123 Fake St![]() |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.