[英]string matching - best distance algorithm to use
I have two dataframes, df1
and df2
, that have information about polling stations.我有两个数据df1
和df2
,其中包含有关投票站的信息。 The dataframes are of different lengths.数据帧具有不同的长度。 Both dataframes have a column called ps_name
, which is the name of the polling stations, and a column called district
that indicates which district the polling stations are located.两个数据框都有一个名为ps_name
的列,它是投票站的名称,还有一个名为district
的列,指示投票站位于哪个区。
I am trying to match strings on the ps_name
column while blocking on the district
column, so I can copy a geolocations
(latitude and longitude) column on matches from df1
to df2
.我正在尝试匹配ps_name
列上的字符串,同时阻塞district
列,因此我可以将匹配的geolocations
(纬度和经度)列从df1
复制到df2
。
So far I've tried using jaro-winkler at threshold 0.88
to compare strings.到目前为止,我已经尝试在阈值0.88
处使用jaro-winkler来比较字符串。
# Matched:
**df1:** AGRICULTURAL OFFICE ATTOCK (MALE) I (P)
**df2:** AGRICULTURAL OFFICE ATTOCK (MALE) (P)
# Did not match:
**df1:** govt girls high school peoples colony attock ii
**df2:** high school peoples colony attock ii
What string distance algorithm should I be using ?我应该使用什么字符串距离算法? I've tried jaro-winkler and was also considering smith-waterman .我试过jaro-winkler并且也在考虑smith-waterman 。
One option is to use Levenshtein distance which is implemented in the package fuzzywuzzy (or here ), the algorithm runs in O(n + d^2), where n is the length of the longer string and d is the edit distance.一种选择是使用在 package blurwuzzy (或此处)中实现的Levenshtein 距离,该算法在 O(n + d^2) 中运行,其中 n 是较长字符串的长度,d 是编辑距离。
Example:例子:
from fuzzywuzzy import fuzz
fuzz.ratio('govt girls high school peoples colony attock ii','high school peoples colony attock ii')
#87
fuzz.ratio('AGRICULTURAL OFFICE ATTOCK (MALE) I (P)', 'AGRICULTURAL OFFICE ATTOCK (MALE) (P)')
#97
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.