如何在 python 的多個字符串中找到相似的內容？

Question

我想在幾個字符串中得到相似的東西。 例如，我有 6 個字符串：

HELLO3456
helf04g
hell0r
h31l0

我想得到這些字符串中的相似之處，例如在這種情況下，我希望它告訴我類似的內容：

h is always at the start

該示例非常簡單，我可以在腦海中弄清楚這一點，但類似於：

61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj
pHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM
fW9K4luEx65RscfUiPDakiqp15jiK5f6
17xz7MYEBoXLPoi8RdqbgkPwTV2T2H0y
Jvt0B5uZIDPJ5pbCqMo12CqD7pdnMSEd
n7voYT0TVVzZGVSLaQNRnnkkWgVqxA3b

那並沒那么簡單。 我已經看到並嘗試過：

僅舉幾例，但它們都不是我要找的。 它們給出了它們相似程度的值，我需要知道它們的相似之處。

我想知道這是否可能，如果可以，我該怎么做。 先感謝您。

Answer 1

最小的解決方案

您使用 difflib 庫位於正確的解決方案路徑上。 我剛剛從您的問題中挑選了前兩個示例來創建一個最小的解決方案。

from difflib import SequenceMatcher


a = "61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj"
b = "pHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM"

Sequencer = SequenceMatcher(None, a, b)

print(Sequencer.ratio())
matches = Sequencer.get_matching_blocks()
print(matches)

for match in matches:
    idx_a = match.a
    idx_b = match.b
    
    if not (idx_a == len(a) or idx_b == len(b)):
        print(30*'-' + 'Found Match' + 30*'-')
        print('found at idx {} of str "a" and at idx {} of str "b" the value {}'.format(idx_a, idx_b, a[idx_a]))

Output：

0.0625
[Match(a=2, b=18, size=1), Match(a=5, b=29, size=1), Match(a=32, b=32, size=0)]
------------------------------Found Match------------------------------
found at idx 2 of str "a" and at idx 18 of str "b" the value T
------------------------------Found Match------------------------------
found at idx 5 of str "a" and at idx 29 of str "b" the value 2

解釋

我只是使用ratio()來查看是否存在任何相似性。 function get_matching_blocks()返回一個列表，其中包含字符串序列中的所有匹配項。 我的最小解決方案不關心相同的 position，但這應該是檢查索引的簡單修復。 在ratio()的返回值等於0.0的情況下，匹配器不會生成空列表。 該列表始終包含序列結尾的匹配項。 我使用匹配的 idice 檢查序列的長度。 另一種解決方案是僅使用大小 > 0 的匹配項，如下所示：

if match.size > 0:
   ...

我的示例也不處理大小> 1 的匹配。我想你會想辦法解決這個問題；）

Answer 2

我認為這應該是您想要的解決方案。 我在每個字符串的開頭添加了“a”，否則您提到的字符串沒有相似之處。

lst = ["A61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj","apHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM","afW9K4luEx65RscfUiPDakiqp15jiK5f6","a17xz7MYEBoXLPoi8RdqbgkPwTV2T2H0y", "aJvt0B5uZIDPJ5pbCqMo12CqD7pdnMSEd","an7voYT0TVVzZGVSLaQNRnnkkWgVqxA3b"]
total_strings = len(lst)
string_length = len(lst[0])
for i in range(total_strings):
    lst[i] = lst[i].lower()

for i in range(string_length):
    flag = 0
    lst_char = lst[total_strings-1][i]
    for j in range(total_strings-1):
        if lst[j][i] == lst_char:
            flag = 1
            continue
        else:
            flag = 0
            break
    if flag == 1:
        print(lst[total_strings-1][i]+" is always at position "+str(i))

如何在 python 的多個字符串中找到相似的內容？

問題描述

2 個解決方案

解決方案1
1 2020-12-04 14:26:16

最小的解決方案

解釋

解決方案2
0 已采納 2020-12-04 14:24:55

如何在 python 的多個字符串中找到相似的內容？

問題描述

2 個解決方案

解決方案1 1 2020-12-04 14:26:16

最小的解決方案

解釋

解決方案2 0 已采納 2020-12-04 14:24:55

解決方案1
1 2020-12-04 14:26:16

解決方案2
0 已采納 2020-12-04 14:24:55