简体   繁体   English

查找两个字符串的图形相似性的距离度量

[英]Find a distance measure of graphical similarity of two strings

I had no luck at finding any package like that, optimally in Python. 我很难找到这样的软件包,最好是在Python中。 Is there some library allowing one to graphically compare two strings? 是否有一些库允许人们以图形方式比较两个字符串?

It would, for instance, be helpful to fight against spam, when one uses я instead of R , or worse, things like Α (capital alpha, 0x0391) instead of A , to obfuscate their strings. 例如,当人们使用я而不是R或更糟的是使用Α (大写字母alpha,0x0391)而不是A来混淆字符串时,对抗垃圾邮件会有所帮助。

The interface to such a package could be something like 此类软件包的接口可能类似于

distance("Foo", "Bar")  # large distance
distance("Αяe", "Are")  # small distance

Thanks! 谢谢!

I'm not aware of a package that does this. 我不知道执行此操作的程序包。 However, you may be able to use tools like the homoglyph attack generator , the Unicode Consortium's confusables , references from wikipedia's page on the IDN homograph attack , or other such resources to build your own library of look-alikes and build a score based on that. 但是,您也许可以使用类似的象形文字攻击生成器 ,Unicode联盟的可混淆对象 ,来自Wikipedia网页上有关IDN同形异义词攻击的引用,或其他类似资源来构建您自己的相似库并以此为基础建立分数。

EDIT : It looks as though the Unicode folks have compiled a great, big database of characters that looks alike. 编辑 :看来Unicode伙计们已经汇编了一个很大的相似字符数据库。 It's available here . 在这里可用。 If I were you, I'd build a script to read this into a Python dictionary and then parse your string for matches. 如果您是我,我将构建一个脚本以将其读入Python词典,然后解析您的字符串以进行匹配。 An excerpt is: 摘录是:

FF4A ;  006A ;  MA  # ( j → j ) FULLWIDTH LATIN SMALL LETTER J → LATIN SMALL LETTER J # →ϳ→
2149 ;  006A ;  MA  # ( ⅉ → j ) DOUBLE-STRUCK ITALIC SMALL J → LATIN SMALL LETTER J # 
1D423 ; 006A ;  MA  # ( 𝐣 → j ) MATHEMATICAL BOLD SMALL J → LATIN SMALL LETTER J  # 
1D457 ; 006A ;  MA  # ( 𝑗 → j ) MATHEMATICAL ITALIC SMALL J → LATIN SMALL LETTER J  # 

With the information @Richard supplied in his answer , I came up with this short Python 3 script that implements UTS#39 : 有了他的答案中提供的@Richard信息,我想到了实现UTS#39的这个简短的Python 3脚本:

"""Implement the simple algorithm laid out in UTS#39, paragraph 4
"""

import csv
import re
import unicodedata

comment_pattern = re.compile(r'\s*#.*$')


def skip_comments(lines):
    """
    A filter which skip/strip the comments and yield the
    rest of the lines

    :param lines: any object which we can iterate through such as a file
        object, list, tuple, or generator
    """

    for line in lines:
        line = comment_pattern.sub('', line).strip()
        if line:
            yield line


def normalize(s):
    return unicodedata.normalize("NFD", s)


def to_unicode(code_point):
    return chr(int("0x" + code_point.lower(), 16))


def read_table(file_name):
    d = {}
    with open(file_name) as f:
        reader = csv.reader(skip_comments(f), delimiter=";")
        for row in reader:
            source = to_unicode(row[0])
            prototypes = map(to_unicode, row[1].strip().split())
            d[source] = ''.join(prototypes)
    return d
TABLE = read_table("confusables.txt")


def skeleton(s):
    s = normalize(s)
    s = ''.join(TABLE.get(c, c) for c in s)
    return normalize(s)


def confusable(s1, s2):
    return skeleton(s1) == skeleton(s2)


if __name__ == "__main__":
    for strings in [("Foo", "Bar"), ("Αяe", "Are"), ("j", "j")]:
        print(*strings)
        print("Equal:", strings[0] == strings[1])
        print("Confusable:", confusable(*strings), "\n")

It assumes that the file confusables.txt is in the directory the script is being run from. 假定文件confusables.txt在运行脚本的目录中。 In addition, I had to delete the first byte of that file, because it was some weird, not-printable, symbol. 另外,我不得不删除该文件的第一个字节,因为它是一个奇怪的,不可打印的符号。

It only follows the simple algorithm laid out at the beginning of paragraph 4, not the more complicated cases of whole- and mixed-script confusables laid out in 4.1 and 4.2. 它仅遵循第4段开头列出的简单算法,而不遵循4.1和4.2中列出的更复杂的全脚本和混合脚本易混淆案例。 That is left as an exercise to the reader. 这留给读者练习。

Note that "я" and "R" are not considered confusable by the unicode group, so this will return False for those two strings. 请注意,unicode组不会将“я”和“ R”混淆,因此对于这两个字符串,它将返回False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM