简体   繁体   中英

How is Levenshtein Distance calculated on Simplified Chinese characters?

I have 2 queries:

    query1:你好世界
    query2:你好

When i run this code using the python library Levenshtein:

from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist

I get an output of 12. Now the question is how is the value 12 derived?

Because in terms of strokes difference, theres definitely more than 12.

According to its documentation , it supports unicode:

It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses).

You need to make sure the Chinese characters are in unicode though:

In [1]: from Levenshtein import distance, hamming, median

In [2]: query1 = '你好世界'

In [3]: query2 = '你好'

In [4]: print distance(query1,query2)
6

In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM