简体   繁体   English

在具有汉明距离的较大字符串中搜索子字符串的存在

[英]searching for the presence of a substring in a larger string with Hamming distance

I have 2 files, file1 and file2 file 1 has all the 4-mer,5-mer and 6-mer substrings of the full string "abcdef"我有 2 个文件,file1 和 file2 文件 1 具有完整字符串“abcdef”的所有 4-mer、5-mer 和 6-mer 子字符串

file2 has longer strings like file2 有更长的字符串,如

ddghtgabcdtttfwe ddghtgabcdtttfwe

ddghtgabdatttfwe dghtgabdatttfwe

hhttaaddsbcdeggd hhttaaddsbcdeggd

etc. I want to see if the strings in file2 have matches in the strings in file1 allowing for some mismatches (maximum Hamming distance 2).等我想看看 file2 中的字符串是否与 file1 中的字符串匹配,允许一些不匹配(最大汉明距离 2)。 For example ddghtgabcdtttfwe and ddghtgabcdatttfwe are hits for substring abcd and abcd,abcde respectively.例如 ddghtgabcdtttfwe 和 ddghtgabcdatttfwe 分别是子串 abcd 和 abcd,abcde 的命中。 Can you suggest a good way of doing this in python你能建议一个在python中做到这一点的好方法吗

Partial solution:部分解决方案:

def hamming(s1,s2):
    return len([(c1,c2) for (c1,c2) in zip(s1,s2) if c1 != c2])

and then:进而:

def almostIn(s1,s2):
    n = len(s1)
    for s in (s2[i:i+n] for i in range(1 + len(s2)-n)):
        if hamming(s1,s) <= 2: return True
    return False

This later function will return True if s1 occurs in s2 with Hamming distance <= 2. There is a certain amount of rescanning the same characters, so this is probably not optimal, but might be good enough for your intended application.如果 s1 在汉明距离 <= 2 的情况下出现在 s2 中,那么后面的函数将返回 True。有一定数量的重新扫描相同的字符,因此这可能不是最佳的,但可能足以满足您的预期应用程序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM