简体   繁体   English

计算两个字符串之间距离的算法

[英]Algorithms for computing the distance between two strings

Is there any string distance algorithm that doesnt not take into account the order of the words? 是否有任何字符串距离算法没有考虑到单词的顺序?

The following algorithms do not give the desired results(in that example the desired result should be 1): 以下算法未提供所需结果(在该示例中,所需结果应为1):

import jaro
jaro.jaro_winkler_metric(u'Michael Jordan',u'Jordan Michael')
>>>0.47

import Levenshtein
Levenshtein.ratio('Michael Jordan', 'Jordan Michael')
>>>0.5

from difflib import SequenceMatcher
SequenceMatcher(None, 'Michael Jordan', 'Jordan Michael').ratio()
>>>0.5

One way to making that is to have the string in alphabetical order and later use on of the above algorithms: 制作它的一种方法是按字母顺序排列字符串,然后使用上述算法:

''.join(sorted('Michael Jordan'))
>>>' JMaacdehilnor'

''.join(sorted('Jordan Michael'))
>>>' JMaacdehilnor'

But here the information of the name and surname is lost and will not have 'stable' results. 但是这里姓名和姓氏的信息丢失了,不会有“稳定”的结果。

I have created a function ,using permutations from itertools , that takes all the possible compilations of the words and compare the strings and output the max value. 我使用itertools permutations创建了一个函数,它接受所有可能的单词编译并比较字符串并输出最大值。 The results are satisfactory but the whole procedure is really slow when I have to compare millions of names. 结果令人满意,但是当我必须比较数百万个名字时,整个程序真的很慢。

Something else that can be done is to sort the words such as: 可以做的其他事情是对单词进行排序,例如:

' '.join(sorted('Michael Jordan'.split()))
>>>'Jordan Michael'
' '.join(sorted('Jordan Michael'.split()))
>>>'Jordan Michael'

Seems quite nice way and easy way to decrease the computations but we loose some sensitive cases. 似乎很好的方式和简单的方法来减少计算,但我们放松了一些敏感的情况。 example: 例:

name1 = ' '.join(sorted('Bizen Dim'.split()))
>>>'Bizen Dim'
name2 = ' '.join(sorted('Dim Mpizen'.split()))
>>>'Dim Mpizen'

SequenceMatcher(None, name1, name2).ratio()
>>>  0.55

These two names are the same as there are cases where people 'translating' their names from 'b' to 'mp' (I am one of them). 这两个名字是相同的,有些人将'他们的名字'从'b'翻译成'mp'(我就是其中之一)。 Using this way we are loosing this 'match'. 用这种方式我们就失去了这个'匹配'。

Is there any string distance algorithm that compares the words and do not take into consideration the order of the words? 是否有任何字符串距离算法比较单词而不考虑单词的顺序? Or is there a recommendation how to implement efficiently the desired function? 或者是否有建议如何有效地实现所需的功能?

try fuzzywuzzy 尝试fuzzywuzzy

install: 安装:

pip install fuzzywuzzy
pip install python-Levenshtein

use with order not mattering: 使用顺序无关紧要:

fuzz.token_sort_ratio(u'Michael Jordan',u'Jordan Michael')
>>100

Try casting to lowercase, then sorting. 尝试转换为小写,然后排序。 Your problem with sorting with the original string is python sees capitals as higher in the order. 使用原始字符串排序的问题是python看到大写的顺序更高。 (if you're going for levenshtein distance, the spaces shouldn't be an issue) (如果你要去levenshtein距离,空间应该不是问题)

>>> ''.join(sorted('Michael Jordan'.lower()))
' aacdehijlmnor'

Then use the .index() method to get substring positions. 然后使用.index()方法获取子字符串位置。 (you can also use this answer that uses the re module and makes it much more varitable) (您也可以使用这个使用re模块的答案 ,使其变得更加可变)

您可以对两个字符串进行标记(例如,使用NLTK标记器),计算每个字对之间的距离并返回所有距离的总和。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM