简体   繁体   中英

Levenshtein distance in Python - wrong result with national characters

I found similar topic: Levenshtein distance on diacritic characters , but it's PHP and I write in Python. Still, problem remains the same. For instance: levenshtein(kot, kod) = 1 levenshtein(się, sie) = 2, which is wrong. Any ideas on how to solve this?

First of all you have to make sure that the strings are both in unicode. For Python 3 you have that automatically, but in Python 2 you have to decode the strings to unicode type first. For example sys.argv[1].decode('utf-8') , if you know that the encoding in the console is UTF-8. You may try to guess this encoding with sys.stdin.encoding .

After that you may have to normalize unicode. For example unicode strings u'\Ç' and u'\C\̧' have the same representation Ç, but they would compare as non-equal, and would have non-zero levenshtein distance. To normalize strings you can use unicodedata.normalize function.

The script in Python 2 might look something like this:

import unicodedata
import sys
# import or define your levenshtein function here

def decode_and_normalize(s):
    return unicodedata.normalize('NFKC', s.decode('utf-8'))

s1 = decode_and_normalize(sys.argv[1])
s2 = decode_and_normalize(sys.argv[2])
print levenshtein(s1, s2)

And after all that you may still run into problems if the characters are outside Basic Multilingual Plane . On this issue look at this stackoverlow question .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM