简体   繁体   中英

Fuzzy Matching between 2 columns in pandas dataframe

I have a an excel file with two columns consisting of names. I need to compare the two columns(side by side) and give a fuzzy score in another column.

Any idea as how to do it?

You can use the fuzzywuzzy module to calculate the fuzzy score between two items on the same row and then iterate over the rows. Or if your dataset is very long this could probably be vectorized. The link below got me going with fuzzywuzzy last week: https://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python/

Python Solution: I've implemented the code in Python with parallel processing, which will be much faster than serial computation. Furthermore, where a fuzzy metric score exceeds a threshold, only those computations are performed in parallel. Please see the link below for the code:

https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py

Vesrion Compatibility:

pandas version :: 1.1.5 ,
fuzzywuzzy version :: 1.1.0 ,
joblib version :: 0.18.0

Fuzzywuzzy metric explanation: link text

Output from code: 在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM