简体   繁体   English

熊猫数据框中两列之间的模糊匹配

[英]Fuzzy Matching between 2 columns in pandas dataframe

I have a an excel file with two columns consisting of names.我有一个包含两列名称的 Excel 文件。 I need to compare the two columns(side by side) and give a fuzzy score in another column.我需要比较两列(并排)并在另一列中给出一个模糊分数。

Any idea as how to do it?知道怎么做吗?

You can use the fuzzywuzzy module to calculate the fuzzy score between two items on the same row and then iterate over the rows.您可以使用 fuzzywuzzy 模块计算同一行上两个项目之间的模糊分数,然后迭代这些行。 Or if your dataset is very long this could probably be vectorized.或者,如果您的数据集很长,这可能会被矢量化。 The link below got me going with fuzzywuzzy last week: https://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python/下面的链接让我上周开始使用模糊模糊: https ://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python/

Python Solution: I've implemented the code in Python with parallel processing, which will be much faster than serial computation. Python 解决方案:我已经使用并行处理在 Python 中实现了代码,这将比串行计算快得多。 Furthermore, where a fuzzy metric score exceeds a threshold, only those computations are performed in parallel.此外,在模糊度量分数超过阈值的情况下,只有那些计算是并行执行的。 Please see the link below for the code:请参阅以下链接以获取代码:

https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py

Vesrion Compatibility:版本兼容性:

pandas version :: 1.1.5 ,
fuzzywuzzy version :: 1.1.0 ,
joblib version :: 0.18.0

Fuzzywuzzy metric explanation: link text Fuzzywuzzy 度量解释: 链接文本

Output from code:代码输出: 在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM