简体   繁体   English

字符串匹配 - 使用的最佳距离算法

[英]string matching - best distance algorithm to use

I have two dataframes, df1 and df2 , that have information about polling stations.我有两个数据df1df2 ,其中包含有关投票站的信息。 The dataframes are of different lengths.数据帧具有不同的长度。 Both dataframes have a column called ps_name , which is the name of the polling stations, and a column called district that indicates which district the polling stations are located.两个数据框都有一个名为ps_name的列,它是投票站的名称,还有一个名为district的列,指示投票站位于哪个区。

I am trying to match strings on the ps_name column while blocking on the district column, so I can copy a geolocations (latitude and longitude) column on matches from df1 to df2 .我正在尝试匹配ps_name列上的字符串,同时阻塞district列,因此我可以将匹配的geolocations (纬度和经度)列从df1复制到df2

So far I've tried using jaro-winkler at threshold 0.88 to compare strings.到目前为止,我已经尝试在阈值0.88处使用jaro-winkler来比较字符串。

# Matched:
**df1:** AGRICULTURAL OFFICE ATTOCK (MALE) I (P)
**df2:** AGRICULTURAL OFFICE ATTOCK (MALE) (P)

# Did not match:
**df1:** govt girls high school peoples colony attock ii
**df2:** high school peoples colony attock ii

What string distance algorithm should I be using ?我应该使用什么字符串距离算法 I've tried jaro-winkler and was also considering smith-waterman .我试过jaro-winkler并且也在考虑smith-waterman

One option is to use Levenshtein distance which is implemented in the package fuzzywuzzy (or here ), the algorithm runs in O(n + d^2), where n is the length of the longer string and d is the edit distance.一种选择是使用在 package blurwuzzy (或此处)中实现的Levenshtein 距离,该算法在 O(n + d^2) 中运行,其中 n 是较长字符串的长度,d 是编辑距离。

Example:例子:

from fuzzywuzzy import fuzz
fuzz.ratio('govt girls high school peoples colony attock ii','high school peoples colony attock ii') 
#87
fuzz.ratio('AGRICULTURAL OFFICE ATTOCK (MALE) I (P)', 'AGRICULTURAL OFFICE ATTOCK (MALE) (P)')
#97

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 最佳距离算法 - Best distance algorithm 使用 fuzzywuzzy 的字符串匹配 - 它是使用 Levenshtein 距离还是 Ratcliff/Obershelp 模式匹配算法? - String Matching using fuzzywuzzy- is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm? Naive字符串匹配算法的实现 - Implementation of Naive string matching algorithm 自定义模糊模糊字符串匹配以编辑距离&lt;= 1 - Customizing fuzzywuzzy string matching to edit distance <= 1 查找边权重为 1 的所有对的距离的最佳算法 - best algorithm for finding distance for all pairs where edges' weight is 1 蛮力字符串匹配算法的运行时 - Runtime of brute force string-matching algorithm 什么是 Python 中的简单模糊字符串匹配算法? - What is a simple fuzzy string matching algorithm in Python? 向用户交叉匹配事件的最著名算法是什么? - What is the best known algorithm of cross-matching events to a user? 使用PySpark中的Levenshtein距离在两列之间进行字符串匹配 - String matching function between two columns using Levenshtein distance in PySpark 我应该使用 sklearn 的分类算法还是仅使用欧氏距离 - Should I use a classification algorithm from sklearn or just euclidean distance
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM