简体   繁体   English

使用部分字符串匹配在两个Pandas数据帧之间进行映射/压缩

[英]Mapping/Zipping between two Pandas data frames with a partial string match

I have two dataframes of size roughly 1,000,000 rows each. 我有两个大小约为1,000,000行的数据帧。 Both share a common 'Address' column which I am using to join the dataframes. 两者共享一个共同的“地址”列,我用它来加入数据帧。 Using this join, I wish to move information, which I shall call 'details', from dataframe1 to dataframe2. 使用此连接,我希望将信息(我称之为“详细信息”)从dataframe1移动到dataframe2。

df2.details = df2.Address.map(dict(zip(df1.Address,df1.details)))

However, the address column does not exhibit entire commonality. 但是,地址列不具有完整的通用性。 I tried cleaning as best I could, but still can only move roughly 40% of the data across. 我尽可能地尝试清洁,但仍然只能移动大约40%的数据。 Is there a way to modify my above code to allow for a partial match? 有没有办法修改我的上面的代码,以允许部分匹配? I'm totally stumped on this one. 我完全被这个困扰了。

Data is quite simply as described. 数据非常简单,如描述的那样。 Two small dataframes. 两个小数据帧。 Fabricated sample data below: 以下制作的样本数据:

df1 
Address                                    Details
Apt 15 A, Long Street, Fake town, US       A   


df2
Address                                    Details
15A, Long Street, Fake town, U.S.              

First, I would recommend performing the join operation and identifying the rows in each data frame that do not have a perfect match. 首先,我建议执行join操作并识别每个数据框中没有完美匹配的行。 Once you have identified these rows, exclude the others and proceed with the following suggestions: 确定这些行后,排除其他行并继续执行以下建议:

  • One approach is to parse the addresses and attempt to standardize them. 一种方法是解析地址并尝试将它们标准化。 You might try using the usaddress module to standardize your addresses. 您可以尝试使用usaddress模块来标准化您的地址。

  • You could also try the approaches recommended in answer to this question , although they may take some tweaking for your case. 您也可以尝试回答这个问题时推荐的方法,尽管他们可能会对您的案例进行一些调整。 It's hard to say without multiple examples of the partial string matches. 如果没有部分字符串匹配的多个示例,很难说。

  • Another approach would be to use the Google Maps API (or Bing or MapQuest) for address standardization, though with over million rows per data frame you will far out strip the free API calls/day and would need to pay for the service. 另一种方法是使用Google Maps API(或Bing或MapQuest)进行地址标准化,但每个数据框有超过一百万行,您将远离去掉每天免费的API调用,并且需要为该服务付费。

  • A final suggestion is to use the fuzzywuzzy module for fuzzy (approximate) string matching. 最后的建议是使用fuzzywuzzy模块进行模糊(近似)字符串匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM