汉明距离的逆

Question

*这是一个简短的介绍，具体问题在最后一段以粗体显示。

我正在尝试生成具有给定汉明距离的所有字符串，以有效地解决生物信息学分配。

这个想法是，给定一个字符串（即'ACGTTGCATGTCGCATGATGCATGAGAGCT'），搜索单词的长度（即4）和在字符串中搜索该单词时可接受的不匹配（即1），返回最常用的单词或'突变'的话。

要清楚，给定字符串中的长度为4的单词可以是这个（在'[]'之间）：

[ACGT]TGCATGTCGCATGATGCATGAGAGCT #ACGT

这个

A[CGTT]GCATGTCGCATGATGCATGAGAGCT #CGTT

或这个

ACGTTGCATGTCGCATGATGCATGAG[AGCT] #AGCT

我所做的是（并且它的效率非常低，而且当单词需要有10个字符时它真的很慢）会生成具有给定距离的所有可能的单词：

itertools.imap(''.join, itertools.product('ATCG', repeat=wordSize))

如果生成的单词（或其变异）出现在循环中，则搜索并比较给定字符串中的每个单词：

wordFromString = givenString[i:i+wordSize]
mismatches = sum(ch1 != ch2 for ch1, ch2 in zip(wordFromString, generatedWord))
if mismatches <= d:
    #count that generated word in a list for future use
    #(only need the most repeated)

我想要做的是，而不是生成所有可能的单词，只生成给定字符串中出现的具有给定数量的不匹配的单词的突变，换句话说， 给定汉明距离和单词，返回所有可能的具有该（或更小）距离的变异单词 ，然后使用它们在给定的字符串中进行搜索。

我希望我很清楚。 谢谢。

Answer 1

def mutations(word, hamming_distance, charset='ATCG'):
    for indices in itertools.combinations(range(len(word)), hamming_distance):
        for replacements in itertools.product(charset, repeat=hamming_distance):
            mutation = list(word)
            for index, replacement in zip(indices, replacements):
                mutation[index] = replacement
            yield "".join(mutation)

此函数生成汉明距离小于或等于给定数字的单词的所有突变。 它相对有效，并且不检查无效的单词。 但是， 有效突变可能不止一次出现 。 如果希望每个元素都是唯一的，请使用集合。

Answer 2

让给定的汉明距离为D ，让w为“单词”子串。 从W，你可以通过一个深度限制产生与距离≤d所有单词深度优先搜索：

def dfs(w, D):
    seen = set([w])      # keep track of what we already generated
    stack = [(w, 0)]

    while stack:
        x, d = stack.pop()
        yield x

        if d == D:
            continue

        # generate all mutations at Hamming distance 1
        for i in xrange(len(x)):
            for c in set("ACGT") - set(x[i])
                 y = x[:i] + c + x[i+1:]
                 if y not in seen:
                     seen.add(y)
                     stack.append((y, d + 1))

（这绝不会很快，但它可以作为灵感。）

Answer 3

如果我正确理解你的问题，你想要确定基因组G得分最高的k-mers。 k-mer的分数是它在基因组中出现的次数加上汉明距离小于m任何k-mer也出现在基因组中的次数。 请注意，这假设您只对基因组中出现的k-mers感兴趣（正如@j_random_hacker所指出的那样）。

您可以通过四个基本步骤解决此问题：

识别基因组G中的所有k聚体。
计算每个k-mer出现在G 。
对于k-mers的每对（ K1 ， K2 ），如果它们的汉明距离小于m ，则增加K1和K2的计数。
找到max k-mer及其计数。

这是Python代码的示例：

from itertools import combinations
from collections import Counter

# Hamming distance function
def hamming_distance(s, t):
    if len(s) != len(t):
        raise ValueError("Hamming distance is undefined for sequences of different lengths.")
    return sum( s[i] != t[i] for i in range(len(s)) )

# Main function
# - k is for k-mer
# - m is max hamming distance
def hamming_kmer(genome, k, m):
    # Enumerate all k-mers
    kmers = [ genome[i:i+k] for i in range(len(genome)-k + 1) ]

    # Compute initial counts
    initcount  = Counter(kmers)
    kmer2count = dict(initcount)

    # Compare all pairs of k-mers
    for kmer1, kmer2 in combinations(set(kmers), 2):
        # Check if the hamming distance is small enough
        if hamming_distance(kmer1, kmer2) <= m:
            # Increase the count by the number of times the other
            # k-mer appeared originally
            kmer2count[kmer1] += initcount[kmer2]
            kmer2count[kmer2] += initcount[kmer1]

    return kmer2count


# Count the occurrences of each mismatched k-mer
genome = 'ACGTTGCATGTCGCATGATGCATGAGAGCT'
kmer2count = hamming_kmer(genome, 4, 1)

# Print the max k-mer and its count
print max(kmer2count.items(), key=lambda (k,v ): v )
# Output => ('ATGC', 5)

Answer 4

以下是我认为您要解决的问题是：您有一个长度为n的“基因组”，并且您希望找到在该基因组中最常出现的k-mer，其中“近似出现”表示出现汉明距离<= d。 此k链节不需要实际出现在基因组中的任何地方精确地 （例如，用于基因组ACCA ，K = 3，d = 1，最好k链节是CCC ，出现两次）。

如果您从字符串中的某个k-mer生成汉明距离<= d的所有k聚体，然后在字符串中搜索每个k-me，就像您现在正在做的那样，那么您将添加一个不必要的O（n）搜索时间的因素（除非你使用Aho-Corasick算法同时搜索所有这些因素，但这样做太过分了）。

你可以通过遍历基因组做得更好，并且在每个位置i，从基因组中的位置i开始生成距离k-mer距离<= d的所有k聚体的集合，并且为每个位置递增一个计数器。一。

Answer 5

def generate_permutations_close_to(initial = "GACT",charset="GACT"):
    for i,c in enumerate(initial):
         for x in charset:
             yield initial[:i] + x + inital[i+1:]

会产生一个dist为1的排列（它也会重复包含重复）

得到一组在2中的所有...然后用每个第一个解决方案作为初始猜测来调用它

Answer 6

这里有正确的答案，它大量利用python的神奇功能，几乎可以为你做任何事情。 我将尝试用数学和算法来解释事物，以便您可以将它应用于您想要的任何语言。

所以，你有一个字母{a1, a2, ... a_a}的基数a你的情况） {'A', 'C', 'G', 'T'}和基数为4。你有长度为k字符串，您希望生成汉明距离小于或等于d所有字符串。

首先，你有多少人？ 答案不依赖于您选择的字符串。 如果你选择了一个刺，你将得到C(d, k)(a-1)^d字符串，它们与你的字符串有一个汉明距离d 。 所以字符串的总数是：

它几乎每个参数都呈指数级增长，所以你不会有任何类型的快速算法来查找所有单词。

那么你将如何推导出能够生成所有字符串的算法呢？ 请注意，很容易生成一个字符串，该字符串距离您的最多一个汉明距离。 您只需要迭代字符串中的所有字符，并为每个字符尝试字母表中的每个字母。 正如您将看到的，一些单词将是相同的。

现在要生成距离字符串两个汉明距离的所有字符串，您可以应用相同的函数，为前一次迭代中的每个字生成一个汉明距离字。

所以这是一个伪代码：

function generateAllHamming(str string, distance int): 
    alphabet = ['A', ...]// array of letters in your alphabet
    results = {} // empty set that will store all the results
    prev_strings = {str} // set that have strings from the previous iterations
    // sets are used to get rid of duplicates

    if distance > len(str){ distance = len(str)} // you will not get any new strings if the distance is bigger than length of the string. It will be the same all possible strings.

    for d = 0; d < distance; d++ {
        for one_string in prev_strings {
            for pos = 0; pos < len(one_string); pos++ {
                for char in alphabet {
                    new_str = substitute_char_at_pos(one_string, char, pos)
                    add new_str to set results 
                }
            }
        }

        populate prev_strings with elements from results
    }

    return your results
}

汉明距离的逆

问题描述

6 个解决方案

解决方案1
18 已采纳 2013-11-12 22:54:31

解决方案2
5 2013-11-12 22:32:42

解决方案3
5 2013-11-12 22:38:24

解决方案4
2 2013-11-12 23:19:25

解决方案5
1 2013-11-12 22:31:51

解决方案6
0 2015-12-30 05:18:13

汉明距离的逆

问题描述

6 个解决方案

解决方案1 18 已采纳 2013-11-12 22:54:31

解决方案2 5 2013-11-12 22:32:42

解决方案3 5 2013-11-12 22:38:24

解决方案4 2 2013-11-12 23:19:25

解决方案5 1 2013-11-12 22:31:51

解决方案6 0 2015-12-30 05:18:13

解决方案1
18 已采纳 2013-11-12 22:54:31

解决方案2
5 2013-11-12 22:32:42

解决方案3
5 2013-11-12 22:38:24

解决方案4
2 2013-11-12 23:19:25

解决方案5
1 2013-11-12 22:31:51

解决方案6
0 2015-12-30 05:18:13