简体   繁体   English

如何用简体中文字符计算Levenshtein距离?

[英]How is Levenshtein Distance calculated on Simplified Chinese characters?

I have 2 queries: 我有2个查询:

    query1:你好世界
    query2:你好

When i run this code using the python library Levenshtein: 当我使用python库Levenshtein运行此代码时:

from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist

I get an output of 12. Now the question is how is the value 12 derived? 我得到的输出为12。现在的问题是值12是如何得出的?

Because in terms of strokes difference, theres definitely more than 12. 因为就笔画差异而言,肯定超过12。

According to its documentation , it supports unicode: 根据其文档 ,它支持unicode:

It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses). 它同时支持普通字符串和Unicode字符串,但不能混合使用,函数(方法)的所有参数都必须具有相同的类型(或其子类)。

You need to make sure the Chinese characters are in unicode though: 但是,您需要确保中文字符为unicode:

In [1]: from Levenshtein import distance, hamming, median

In [2]: query1 = '你好世界'

In [3]: query2 = '你好'

In [4]: print distance(query1,query2)
6

In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM