简体   繁体   中英

Levenshtein distance with scrambling of characters?

I am looking for a string comparison metric ala Levenshtein that will also work when the characters in the string have been scrambled up. Does anyone know of such a metric? It would also be great if there was a Python module that could calculate such a metric. Thanks!

您可以尝试difflib库,或者还有一个名为pylevenshtein的外部库。

Count the number of each type of character (using a HashMap or equivalent) then subtract the resultant values and take the absolute value of each subtraction. Add all those together, then divide by 2 (because you've double counted each difference).

Example:

banana
batman

a - 3 , 2 -> |1| -> 1
b - 1 , 1 -> |0| -> 0
m - 0 , 1 -> |-1| -> 1
n - 2 , 1 -> |1| -> 1
t - 0 , 1 -> |-1| -> 1

Therefore you have 1+1+1+1 = 4 -> 4/2 = 2

Check: In banana , change one n to a t and one a to an m (2 changes) and you have the letters in batman

If the strings are of different lengths, calculate the difference in length of the string, subtract that number from your difference count (above). Then divide by 2, then add that number back.

Example:

nab
banana

total difference count: 3
3 - 3 = 0 -> 0 / 2 = 0 -> 0 + 3 = 3

Also I wouldn't use Levenshtein at all here because a lot of the difficulty with that problem is positioning, which you don't care about.

The dynamic programing solution of levenstien distance can be edited simply to catch pair wise scrambling for eg delhi, dehli and give this less weightage compared to coresponding substitutions or additions or deletions.

Edit: This algorithm already exists and is named as Damerau–Levenshtein distance . Searching on this algorithm will give you a Python package which you can use directly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM