I found similar topic: Levenshtein distance on diacritic characters , but it's PHP and I write in Python. Still, problem remains the same. For instance: levenshtein(kot, kod) = 1 levenshtein(się, sie) = 2, which is wrong. Any ideas on how to solve this?
First of all you have to make sure that the strings are both in unicode. For Python 3 you have that automatically, but in Python 2 you have to decode the strings to unicode
type first. For example sys.argv[1].decode('utf-8')
, if you know that the encoding in the console is UTF-8. You may try to guess this encoding with sys.stdin.encoding
.
After that you may have to normalize unicode. For example unicode strings u'\Ç'
and u'\C\̧'
have the same representation Ç, but they would compare as non-equal, and would have non-zero levenshtein distance. To normalize strings you can use unicodedata.normalize
function.
The script in Python 2 might look something like this:
import unicodedata
import sys
# import or define your levenshtein function here
def decode_and_normalize(s):
return unicodedata.normalize('NFKC', s.decode('utf-8'))
s1 = decode_and_normalize(sys.argv[1])
s2 = decode_and_normalize(sys.argv[2])
print levenshtein(s1, s2)
And after all that you may still run into problems if the characters are outside Basic Multilingual Plane . On this issue look at this stackoverlow question .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.