查找兩個字符串的圖形相似性的距離度量

Question

我很難找到這樣的軟件包，最好是在Python中。 是否有一些庫允許人們以圖形方式比較兩個字符串？

例如，當人們使用я而不是R或更糟的是使用Α （大寫字母alpha，0x0391）而不是A來混淆字符串時，對抗垃圾郵件會有所幫助。

此類軟件包的接口可能類似於

distance("Foo", "Bar")  # large distance
distance("Αяe", "Are")  # small distance

謝謝！

Answer 1

我不知道執行此操作的程序包。 但是，您也許可以使用類似的象形文字攻擊生成器，Unicode聯盟的可混淆對象，來自Wikipedia網頁上有關IDN同形異義詞攻擊的引用，或其他類似資源來構建您自己的相似庫並以此為基礎建立分數。

編輯：看來Unicode伙計們已經匯編了一個很大的相似字符數據庫。 在這里可用。 如果您是我，我將構建一個腳本以將其讀入Python詞典，然后解析您的字符串以進行匹配。 摘錄是：

FF4A ;  006A ;  MA  # ( ｊ → j ) FULLWIDTH LATIN SMALL LETTER J → LATIN SMALL LETTER J # →ϳ→
2149 ;  006A ;  MA  # ( ⅉ → j ) DOUBLE-STRUCK ITALIC SMALL J → LATIN SMALL LETTER J # 
1D423 ; 006A ;  MA  # ( 𝐣 → j ) MATHEMATICAL BOLD SMALL J → LATIN SMALL LETTER J  # 
1D457 ; 006A ;  MA  # ( 𝑗 → j ) MATHEMATICAL ITALIC SMALL J → LATIN SMALL LETTER J  #

Answer 2

有了他的答案中提供的@Richard信息，我想到了實現UTS＃39的這個簡短的Python 3腳本：

"""Implement the simple algorithm laid out in UTS#39, paragraph 4
"""

import csv
import re
import unicodedata

comment_pattern = re.compile(r'\s*#.*$')


def skip_comments(lines):
    """
    A filter which skip/strip the comments and yield the
    rest of the lines

    :param lines: any object which we can iterate through such as a file
        object, list, tuple, or generator
    """

    for line in lines:
        line = comment_pattern.sub('', line).strip()
        if line:
            yield line


def normalize(s):
    return unicodedata.normalize("NFD", s)


def to_unicode(code_point):
    return chr(int("0x" + code_point.lower(), 16))


def read_table(file_name):
    d = {}
    with open(file_name) as f:
        reader = csv.reader(skip_comments(f), delimiter=";")
        for row in reader:
            source = to_unicode(row[0])
            prototypes = map(to_unicode, row[1].strip().split())
            d[source] = ''.join(prototypes)
    return d
TABLE = read_table("confusables.txt")


def skeleton(s):
    s = normalize(s)
    s = ''.join(TABLE.get(c, c) for c in s)
    return normalize(s)


def confusable(s1, s2):
    return skeleton(s1) == skeleton(s2)


if __name__ == "__main__":
    for strings in [("Foo", "Bar"), ("Αяe", "Are"), ("ｊ", "j")]:
        print(*strings)
        print("Equal:", strings[0] == strings[1])
        print("Confusable:", confusable(*strings), "\n")

假定文件confusables.txt在運行腳本的目錄中。 另外，我不得不刪除該文件的第一個字節，因為它是一個奇怪的，不可打印的符號。

它僅遵循第4段開頭列出的簡單算法，而不遵循4.1和4.2中列出的更復雜的全腳本和混合腳本易混淆案例。 這留給讀者練習。

請注意，unicode組不會將“я”和“ R”混淆，因此對於這兩個字符串，它將返回False 。

查找兩個字符串的圖形相似性的距離度量

問題描述

2 個解決方案

解決方案1
5 2018-02-08 09:06:02

解決方案2
0 2018-02-08 11:01:37

查找兩個字符串的圖形相似性的距離度量

問題描述

2 個解決方案

解決方案1 5 2018-02-08 09:06:02

解決方案2 0 2018-02-08 11:01:37

解決方案1
5 2018-02-08 09:06:02

解決方案2
0 2018-02-08 11:01:37