字符串匹配 - 使用的最佳距离算法

Question

I have two dataframes, df1 and df2 , that have information about polling stations.我有两个数据df1和df2 ，其中包含有关投票站的信息。 The dataframes are of different lengths.数据帧具有不同的长度。 Both dataframes have a column called ps_name , which is the name of the polling stations, and a column called district that indicates which district the polling stations are located.两个数据框都有一个名为ps_name的列，它是投票站的名称，还有一个名为district的列，指示投票站位于哪个区。

I am trying to match strings on the ps_name column while blocking on the district column, so I can copy a geolocations (latitude and longitude) column on matches from df1 to df2 .我正在尝试匹配ps_name列上的字符串，同时阻塞district列，因此我可以将匹配的geolocations （纬度和经度）列从df1复制到df2 。

So far I've tried using jaro-winkler at threshold 0.88 to compare strings.到目前为止，我已经尝试在阈值0.88处使用jaro-winkler来比较字符串。

# Matched:
**df1:** AGRICULTURAL OFFICE ATTOCK (MALE) I (P)
**df2:** AGRICULTURAL OFFICE ATTOCK (MALE) (P)

# Did not match:
**df1:** govt girls high school peoples colony attock ii
**df2:** high school peoples colony attock ii

What string distance algorithm should I be using ?我应该使用什么字符串距离算法？ I've tried jaro-winkler and was also considering smith-waterman .我试过jaro-winkler并且也在考虑smith-waterman 。

Answer 1

One option is to use Levenshtein distance which is implemented in the package fuzzywuzzy (or here ), the algorithm runs in O(n + d^2), where n is the length of the longer string and d is the edit distance.一种选择是使用在 package blurwuzzy （或此处）中实现的Levenshtein 距离，该算法在 O(n + d^2) 中运行，其中 n 是较长字符串的长度，d 是编辑距离。

Example:例子：

from fuzzywuzzy import fuzz
fuzz.ratio('govt girls high school peoples colony attock ii','high school peoples colony attock ii') 
#87
fuzz.ratio('AGRICULTURAL OFFICE ATTOCK (MALE) I (P)', 'AGRICULTURAL OFFICE ATTOCK (MALE) (P)')
#97

字符串匹配 - 使用的最佳距离算法

问题描述

1 个解决方案

解决方案1
0 2020-07-22 03:40:21

字符串匹配 - 使用的最佳距离算法

问题描述

1 个解决方案

解决方案1 0 2020-07-22 03:40:21

解决方案1
0 2020-07-22 03:40:21