簡體   English   中英

識別 Python 中的相似字符串

[英]Identify Similar Strings in Python

我生成了一個編輯過的 DNA 測序文件,它在不同的行上有單獨的讀數。 並希望消除那些在另一行的一個字符內匹配的字符。

輸入文件:

AAAAAAAAAAAA    #Start checking at line 1
TTTTTTTTTTTT    #Diff by >1 char: Keep
AAAAACAAAAAA    #Diff by 1 char: Delete
AAAAACAAACAA    #Diff by 2 char: Keep
AAAAAAAAAAAA    #Diff by <1 char: Delete

輸出文件:

AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA

到目前為止我所擁有的:

with open(current_file, 'r') as f:
    lineCharsList = []
    outLines = []
    for line in f:
        lineChars = line[:]

        if not (lineChars in lineCharsList):    #exactly matches lines, need partial matching
            lineCharsList.append(lineChars)
            outLines.append(line)
            print line

pip install python-levenshtein並使用函數Levenshtein.hamming來比較字符串。

hamming(string1, string2)計算兩個字符串的漢明距離。

漢明距離只是不同字符的數量。 這意味着字符串的長度必須相同。

例子:

 >>> hamming('Hello world!', 'Holly grail!') 7 >>> hamming('Brian', 'Jesus') 5

代碼是:

import Levenshtein

input_lines = [
    "AAAAAAAAAAAA",
    "TTTTTTTTTTTT",    # Diff by >1 char: Keep
    "AAAAACAAAAAA",    # Diff by 1 char: Delete
    "AAAAACAAACAA",    # Diff by 2 char: Keep
    "AAAAAAAAAAAA",    # Diff by <1 char: Delete
    ]
output_lines = []

for current_line in input_lines:
    for previous_line in output_lines:
        if Levenshtein.hamming(previous_line, current_line) < 2:
            break
    else:
        output_lines.append(current_line)

print('\n'.join(output_lines))

輸出:

AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA

你已經得到了一個很好的答案。

這是我在基本 python 中的實現

with open(current_file, 'r') as f:
    outlines = []
    for line in f:
        z = zip(line, *[el for el in outlines])
        matches = [el[0] in el[1:] for el in z]
        if matches.count(False) > 1:
            outlines.append(line)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM