简体繁体 English

Levenshtein在非英语字符串上的距离

[英]Levenshtein distance on non-English strings

原文 2010-02-17 11:00:07 5 3 java/ levenshtein-distance/ fuzzy-search

Will the Levenshtein distance algorithm work well for non-English language strings too? Levenshtein距离算法是否也适用于非英语语言字符串？

Update : Would this work automatically in a language like Java when comparing Asian characters? 更新：在比较亚洲字符时，这是否会像Java这样的语言自动运行？

3 个解决方案

Only if language is letter based. 只有语言是基于字母的。 For example Russian, German,... but hieroglyph (China for example) or syllable (like Laos) - not. 例如俄语，德语，......但是象形文字（例如中国）或音节（比如老挝） - 不是。

Yes. 是。 But you have to treat the non-english characters as "1 character", not as multiple characters (for example with utf-8). 但是你必须将非英语字符视为“1个字符”，而不是多个字符（例如使用utf-8）。 For example, in python you would use the unicode class to represent the string (and characters). 例如，在python中，您将使用unicode类来表示字符串（和字符）。

Levenshtein doesn't care about languages, it just tells you how many characters need to be changed (added, removed, exchanged) to get from one string to the other. Levenshtein并不关心语言，只是告诉你需要更改（添加，删除，交换）多少个字符才能从一个字符串到另一个字符串。

So: yes, but you'll have to check your charset, some foreign "single" characters my otherwise be treated as two (or more) characters. 所以：是的，但你必须检查你的字符集，一些外国的“单个”字符，否则我将被视为两个（或更多）字符。