How is Levenshtein Distance calculated on Simplified Chinese characters?

Question

I have 2 queries:

    query1:你好世界
    query2:你好

When i run this code using the python library Levenshtein:

from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist

I get an output of 12. Now the question is how is the value 12 derived?

Because in terms of strokes difference, theres definitely more than 12.

Answer 1

According to its documentation , it supports unicode:

It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses).

You need to make sure the Chinese characters are in unicode though:

In [1]: from Levenshtein import distance, hamming, median

In [2]: query1 = '你好世界'

In [3]: query2 = '你好'

In [4]: print distance(query1,query2)
6

In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2

How is Levenshtein Distance calculated on Simplified Chinese characters?

Question

1 answers

solution1
4 ACCPTED 2015-06-19 00:36:40

How is Levenshtein Distance calculated on Simplified Chinese characters?

Question

1 answers

solution1 4 ACCPTED 2015-06-19 00:36:40

solution1
4 ACCPTED 2015-06-19 00:36:40