简体   繁体   English

汉明距离的逆

[英]Inverse of Hamming Distance

*This is a brief introduction, the specific question is in bold at the last paragraph. *这是一个简短的介绍,具体问题在最后一段以粗体显示。

I'm trying to generate all strings with a given Hamming Distance to solve efficiently a bioinformatic assignment. 我正在尝试生成具有给定汉明距离的所有字符串,以有效地解决生物信息学分配。

The idea is, given a string (ie. 'ACGTTGCATGTCGCATGATGCATGAGAGCT'), the length of the word to search (ie. 4) and the acceptable mismatches when searching that word in the string (ie. 1), return the most frequent words or 'mutated' words. 这个想法是,给定一个字符串(即'ACGTTGCATGTCGCATGATGCATGAGAGCT'),搜索单词的长度(即4)和在字符串中搜索该单词时可接受的不匹配(即1),返回最常用的单词或'突变'的话。

To be clear, a word of length 4 from the given string can be this (between '[ ]'): 要清楚,给定字符串中的长度为4的单词可以是这个(在'[]'之间):

[ACGT]TGCATGTCGCATGATGCATGAGAGCT #ACGT

this 这个

A[CGTT]GCATGTCGCATGATGCATGAGAGCT #CGTT

or this 或这个

ACGTTGCATGTCGCATGATGCATGAG[AGCT] #AGCT

What I did was (and its very inefficiently, and its really slow when the words need to have 10 characters) generate all possible words with the given distance: 我所做的是(并且它的效率非常低,而且当单词需要有10个字符时它真的很慢)会生成具有给定距离的所有可能的单词:

itertools.imap(''.join, itertools.product('ATCG', repeat=wordSize))

and then search and compare every word in the given string if the generated words (or its mutation) appears in a loop: 如果生成的单词(或其变异)出现在循环中,则搜索并比较给定字符串中的每个单词:

wordFromString = givenString[i:i+wordSize]
mismatches = sum(ch1 != ch2 for ch1, ch2 in zip(wordFromString, generatedWord))
if mismatches <= d:
    #count that generated word in a list for future use
    #(only need the most repeated)

What I want to do is, instead of generating ALL possible words, generate just the mutations of the words that appear in the given string with a given number of mismatches, in other words, given a Hamming Distance and a word, return all the possible mutated words with that (or less) distance , and then use them for searching in the given string. 我想要做的是,而不是生成所有可能的单词,只生成给定字符串中出现的具有给定数量的不匹配的单词的突变,换句话说, 给定汉明距离和单词,返回所有可能的具有该(或更小)距离的变异单词 ,然后使用它们在给定的字符串中进行搜索。

I hope I was clear. 我希望我很清楚。 Thank you. 谢谢。

def mutations(word, hamming_distance, charset='ATCG'):
    for indices in itertools.combinations(range(len(word)), hamming_distance):
        for replacements in itertools.product(charset, repeat=hamming_distance):
            mutation = list(word)
            for index, replacement in zip(indices, replacements):
                mutation[index] = replacement
            yield "".join(mutation)

This function generates all mutations of a word with a hamming distance less than or equal to a given number. 此函数生成汉明距离小于或等于给定数字的单词的所有突变。 It is relatively efficient, and does not check invalid words. 它相对有效,并且不检查无效的单词。 However, valid mutations can appear more than once . 但是, 有效突变可能不止一次出现 Use a set if you want every element to be unique. 如果希望每个元素都是唯一的,请使用集合

Let the given Hamming distance be D and let w be the "word" substring. 让给定的汉明距离为D ,让w为“单词”子串。 From w , you can generate all words with distance ≤ D by a depth-limited depth-first search : W,你可以通过一个深度限制产生与距离≤d所有单词深度优先搜索

def dfs(w, D):
    seen = set([w])      # keep track of what we already generated
    stack = [(w, 0)]

    while stack:
        x, d = stack.pop()
        yield x

        if d == D:
            continue

        # generate all mutations at Hamming distance 1
        for i in xrange(len(x)):
            for c in set("ACGT") - set(x[i])
                 y = x[:i] + c + x[i+1:]
                 if y not in seen:
                     seen.add(y)
                     stack.append((y, d + 1))

(This will by no means be fast, but it may serve as inspiration.) (这绝不会很快,但它可以作为灵感。)

If I understand your problem correctly, you want to identify the highest score k-mers in a genome G . 如果我正确理解你的问题,你想要确定基因组G得分最高的k-mers。 A k-mer's score is the number of times it appears in the genome plus the number of times any k-mer with Hamming distance less than m also appears in the genome. k-mer的分数是它在基因组中出现的次数加上汉明距离小于m任何k-mer也出现在基因组中的次数。 Note that this assumes you are only interested in k-mers that appear in your genome (as pointed out by @j_random_hacker). 请注意,这假设您只对基因组中出现的k-mers感兴趣(正如@j_random_hacker所指出的那样)。

You can solve this in four basic steps: 您可以通过四个基本步骤解决此问题:

  1. Identify all k-mers in the genome G . 识别基因组G中的所有k聚体。
  2. Count the number of times each k-mer appears in G . 计算每个k-mer出现在G
  3. For each pair ( K1 , K2 ) of k-mers, increment the count for both K1 and K2 if their Hamming distance is less than m . 对于k-mers的每对( K1K2 ),如果它们的汉明距离小于m ,则增加K1K2的计数。
  4. Find the max k-mer and its count. 找到max k-mer及其计数。

Here's example Python code: 这是Python代码的示例:

from itertools import combinations
from collections import Counter

# Hamming distance function
def hamming_distance(s, t):
    if len(s) != len(t):
        raise ValueError("Hamming distance is undefined for sequences of different lengths.")
    return sum( s[i] != t[i] for i in range(len(s)) )

# Main function
# - k is for k-mer
# - m is max hamming distance
def hamming_kmer(genome, k, m):
    # Enumerate all k-mers
    kmers = [ genome[i:i+k] for i in range(len(genome)-k + 1) ]

    # Compute initial counts
    initcount  = Counter(kmers)
    kmer2count = dict(initcount)

    # Compare all pairs of k-mers
    for kmer1, kmer2 in combinations(set(kmers), 2):
        # Check if the hamming distance is small enough
        if hamming_distance(kmer1, kmer2) <= m:
            # Increase the count by the number of times the other
            # k-mer appeared originally
            kmer2count[kmer1] += initcount[kmer2]
            kmer2count[kmer2] += initcount[kmer1]

    return kmer2count


# Count the occurrences of each mismatched k-mer
genome = 'ACGTTGCATGTCGCATGATGCATGAGAGCT'
kmer2count = hamming_kmer(genome, 4, 1)

# Print the max k-mer and its count
print max(kmer2count.items(), key=lambda (k,v ): v )
# Output => ('ATGC', 5)

Here's what I think the problem that you're trying to solve is: You have a "genome" of length n, and you want to find the k-mer that approximately appears most frequently in this genome, where "approximately appears" means appears with Hamming distance <= d. 以下是我认为您要解决的问题是:您有一个长度为n的“基因组”,并且您希望找到在该基因组中最常出现的k-mer,其中“近似出现”表示出现汉明距离<= d。 This k-mer need not actually appear exactly anywhere in the genome (eg for genome ACCA , k=3 and d=1, the best k-mer is CCC , appearing twice). 此k链节不需要实际出现在基因组中的任何地方精确地 (例如,用于基因组ACCA ,K = 3,d = 1,最好k链节是CCC ,出现两次)。

If you generate all k-mers of Hamming distance <= d from some k-mer in the string and then search for each one in the string, as you seem to be currently doing, then you're adding an unnecessary O(n) factor to the search time (unless you search for them all simultaneously using the Aho-Corasick algorithm , but that's overkill here). 如果您从字符串中的某个k-mer生成汉明距离<= d的所有k聚体,然后在字符串中搜索每个k-me,就像您现在正在做的那样,那么您将添加一个不必要的O(n)搜索时间的因素(除非你使用Aho-Corasick算法同时搜索所有这些因素,但这样做太过分了)。

You can do better by going through the genome, and at each position i, generating the set of all k-mers that are at distance <= d from the k-mer starting at position i in the genome, and incrementing a counter for each one. 你可以通过遍历基因组做得更好,并且在每个位置i,从基因组中的位置i开始生成距离k-mer距离<= d的所有k聚体的集合,并且为每个位置递增一个计数器。一。

def generate_permutations_close_to(initial = "GACT",charset="GACT"):
    for i,c in enumerate(initial):
         for x in charset:
             yield initial[:i] + x + inital[i+1:]

will generate permutations with a dist of 1 (it will also repeat contain repeats) 会产生一个dist为1的排列(它也会重复包含重复)

to get a set of all within 2 ... then call this with each of the first solutions as initial guesses 得到一组在2中的所有...然后用每个第一个解决方案作为初始猜测来调用它

There are correct answers here, which heavily utilize python with it's magical functions that do almost everything for you. 这里有正确的答案,它大量利用python的神奇功能,几乎可以为你做任何事情。 I will try to explain things with math and algorithms, so that you can apply it to any language you want. 我将尝试用数学和算法来解释事物,以便您可以将它应用于您想要的任何语言。


So you have an alphabet {a1, a2, ... a_a} (the cardinality of a )in your case {'A', 'C', 'G', 'T'} and the cardinality is 4. You have a string of length k and you want to generate all the strings whose hamming distance is less or equal to d . 所以,你有一个字母{a1, a2, ... a_a}的基数a你的情况) {'A', 'C', 'G', 'T'}和基数为4。你有长度为k字符串,您希望生成汉明距离小于或等于d所有字符串。

First of all how many of them do you have? 首先,你有多少人? The answer does not depend on the string you select. 答案不依赖于您选择的字符串。 If you selected a sting, you will have C(d, k)(a-1)^d strings of which have a hamming distance d from your string. 如果你选择了一个刺,你将得到C(d, k)(a-1)^d字符串,它们与你的字符串有一个汉明距离d So total number of strings is: 所以字符串的总数是:

在此输入图像描述

It rises exponentially in terms of almost every parameter, so you will not have any sort of fast algorithm to find all the words. 它几乎每个参数都呈指数级增长,所以你不会有任何类型的快速算法来查找所有单词。


So how would you derive an algorithm that will generate all the strings? 那么你将如何推导出能够生成所有字符串的算法呢? Notice that it is easy to generate a string which is at most one hamming distance away from your wold. 请注意,很容易生成一个字符串,该字符串距离您的最多一个汉明距离。 You just need to iterate over all characters in the string and for each character try each letter in the alphabet. 您只需要迭代字符串中的所有字符,并为每个字符尝试字母表中的每个字母。 As you will see, some of the words would be the same. 正如您将看到的,一些单词将是相同的。

Now to generate all the strings that are two hamming distances away from your string you can apply the same function that generate one hamming distance words for each word in the previous iteration. 现在要生成距离字符串两个汉明距离的所有字符串,您可以应用相同的函数,为前一次迭代中的每个字生成一个汉明距离字。

So here is a pseudocode: 所以这是一个伪代码:

function generateAllHamming(str string, distance int): 
    alphabet = ['A', ...]// array of letters in your alphabet
    results = {} // empty set that will store all the results
    prev_strings = {str} // set that have strings from the previous iterations
    // sets are used to get rid of duplicates

    if distance > len(str){ distance = len(str)} // you will not get any new strings if the distance is bigger than length of the string. It will be the same all possible strings.

    for d = 0; d < distance; d++ {
        for one_string in prev_strings {
            for pos = 0; pos < len(one_string); pos++ {
                for char in alphabet {
                    new_str = substitute_char_at_pos(one_string, char, pos)
                    add new_str to set results 
                }
            }
        }

        populate prev_strings with elements from results
    }

    return your results
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM