[英]Identify Similar Strings in Python
我生成了一個編輯過的 DNA 測序文件,它在不同的行上有單獨的讀數。 並希望消除那些在另一行的一個字符內匹配的字符。
輸入文件:
AAAAAAAAAAAA #Start checking at line 1
TTTTTTTTTTTT #Diff by >1 char: Keep
AAAAACAAAAAA #Diff by 1 char: Delete
AAAAACAAACAA #Diff by 2 char: Keep
AAAAAAAAAAAA #Diff by <1 char: Delete
輸出文件:
AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA
到目前為止我所擁有的:
with open(current_file, 'r') as f:
lineCharsList = []
outLines = []
for line in f:
lineChars = line[:]
if not (lineChars in lineCharsList): #exactly matches lines, need partial matching
lineCharsList.append(lineChars)
outLines.append(line)
print line
pip install python-levenshtein
並使用函數Levenshtein.hamming
來比較字符串。
hamming(string1, string2)
計算兩個字符串的漢明距離。漢明距離只是不同字符的數量。 這意味着字符串的長度必須相同。
例子:
>>> hamming('Hello world!', 'Holly grail!') 7 >>> hamming('Brian', 'Jesus') 5
代碼是:
import Levenshtein
input_lines = [
"AAAAAAAAAAAA",
"TTTTTTTTTTTT", # Diff by >1 char: Keep
"AAAAACAAAAAA", # Diff by 1 char: Delete
"AAAAACAAACAA", # Diff by 2 char: Keep
"AAAAAAAAAAAA", # Diff by <1 char: Delete
]
output_lines = []
for current_line in input_lines:
for previous_line in output_lines:
if Levenshtein.hamming(previous_line, current_line) < 2:
break
else:
output_lines.append(current_line)
print('\n'.join(output_lines))
輸出:
AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA
你已經得到了一個很好的答案。
這是我在基本 python 中的實現
with open(current_file, 'r') as f:
outlines = []
for line in f:
z = zip(line, *[el for el in outlines])
matches = [el[0] in el[1:] for el in z]
if matches.count(False) > 1:
outlines.append(line)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.