漢明距離的逆

Question

*這是一個簡短的介紹，具體問題在最后一段以粗體顯示。

我正在嘗試生成具有給定漢明距離的所有字符串，以有效地解決生物信息學分配。

這個想法是，給定一個字符串（即'ACGTTGCATGTCGCATGATGCATGAGAGCT'），搜索單詞的長度（即4）和在字符串中搜索該單詞時可接受的不匹配（即1），返回最常用的單詞或'突變'的話。

要清楚，給定字符串中的長度為4的單詞可以是這個（在'[]'之間）：

[ACGT]TGCATGTCGCATGATGCATGAGAGCT #ACGT

這個

A[CGTT]GCATGTCGCATGATGCATGAGAGCT #CGTT

或這個

ACGTTGCATGTCGCATGATGCATGAG[AGCT] #AGCT

我所做的是（並且它的效率非常低，而且當單詞需要有10個字符時它真的很慢）會生成具有給定距離的所有可能的單詞：

itertools.imap(''.join, itertools.product('ATCG', repeat=wordSize))

如果生成的單詞（或其變異）出現在循環中，則搜索並比較給定字符串中的每個單詞：

wordFromString = givenString[i:i+wordSize]
mismatches = sum(ch1 != ch2 for ch1, ch2 in zip(wordFromString, generatedWord))
if mismatches <= d:
    #count that generated word in a list for future use
    #(only need the most repeated)

我想要做的是，而不是生成所有可能的單詞，只生成給定字符串中出現的具有給定數量的不匹配的單詞的突變，換句話說， 給定漢明距離和單詞，返回所有可能的具有該（或更小）距離的變異單詞 ，然后使用它們在給定的字符串中進行搜索。

我希望我很清楚。 謝謝。

Answer 1

def mutations(word, hamming_distance, charset='ATCG'):
    for indices in itertools.combinations(range(len(word)), hamming_distance):
        for replacements in itertools.product(charset, repeat=hamming_distance):
            mutation = list(word)
            for index, replacement in zip(indices, replacements):
                mutation[index] = replacement
            yield "".join(mutation)

此函數生成漢明距離小於或等於給定數字的單詞的所有突變。 它相對有效，並且不檢查無效的單詞。 但是， 有效突變可能不止一次出現 。 如果希望每個元素都是唯一的，請使用集合。

Answer 2

讓給定的漢明距離為D ，讓w為“單詞”子串。 從W，你可以通過一個深度限制產生與距離≤d所有單詞深度優先搜索：

def dfs(w, D):
    seen = set([w])      # keep track of what we already generated
    stack = [(w, 0)]

    while stack:
        x, d = stack.pop()
        yield x

        if d == D:
            continue

        # generate all mutations at Hamming distance 1
        for i in xrange(len(x)):
            for c in set("ACGT") - set(x[i])
                 y = x[:i] + c + x[i+1:]
                 if y not in seen:
                     seen.add(y)
                     stack.append((y, d + 1))

（這絕不會很快，但它可以作為靈感。）

Answer 3

如果我正確理解你的問題，你想要確定基因組G得分最高的k-mers。 k-mer的分數是它在基因組中出現的次數加上漢明距離小於m任何k-mer也出現在基因組中的次數。 請注意，這假設您只對基因組中出現的k-mers感興趣（正如@j_random_hacker所指出的那樣）。

您可以通過四個基本步驟解決此問題：

識別基因組G中的所有k聚體。
計算每個k-mer出現在G 。
對於k-mers的每對（ K1 ， K2 ），如果它們的漢明距離小於m ，則增加K1和K2的計數。
找到max k-mer及其計數。

這是Python代碼的示例：

from itertools import combinations
from collections import Counter

# Hamming distance function
def hamming_distance(s, t):
    if len(s) != len(t):
        raise ValueError("Hamming distance is undefined for sequences of different lengths.")
    return sum( s[i] != t[i] for i in range(len(s)) )

# Main function
# - k is for k-mer
# - m is max hamming distance
def hamming_kmer(genome, k, m):
    # Enumerate all k-mers
    kmers = [ genome[i:i+k] for i in range(len(genome)-k + 1) ]

    # Compute initial counts
    initcount  = Counter(kmers)
    kmer2count = dict(initcount)

    # Compare all pairs of k-mers
    for kmer1, kmer2 in combinations(set(kmers), 2):
        # Check if the hamming distance is small enough
        if hamming_distance(kmer1, kmer2) <= m:
            # Increase the count by the number of times the other
            # k-mer appeared originally
            kmer2count[kmer1] += initcount[kmer2]
            kmer2count[kmer2] += initcount[kmer1]

    return kmer2count


# Count the occurrences of each mismatched k-mer
genome = 'ACGTTGCATGTCGCATGATGCATGAGAGCT'
kmer2count = hamming_kmer(genome, 4, 1)

# Print the max k-mer and its count
print max(kmer2count.items(), key=lambda (k,v ): v )
# Output => ('ATGC', 5)

Answer 4

以下是我認為您要解決的問題是：您有一個長度為n的“基因組”，並且您希望找到在該基因組中最常出現的k-mer，其中“近似出現”表示出現漢明距離<= d。 此k鏈節不需要實際出現在基因組中的任何地方精確地 （例如，用於基因組ACCA ，K = 3，d = 1，最好k鏈節是CCC ，出現兩次）。

如果您從字符串中的某個k-mer生成漢明距離<= d的所有k聚體，然后在字符串中搜索每個k-me，就像您現在正在做的那樣，那么您將添加一個不必要的O（n）搜索時間的因素（除非你使用Aho-Corasick算法同時搜索所有這些因素，但這樣做太過分了）。

你可以通過遍歷基因組做得更好，並且在每個位置i，從基因組中的位置i開始生成距離k-mer距離<= d的所有k聚體的集合，並且為每個位置遞增一個計數器。一。

Answer 5

def generate_permutations_close_to(initial = "GACT",charset="GACT"):
    for i,c in enumerate(initial):
         for x in charset:
             yield initial[:i] + x + inital[i+1:]

會產生一個dist為1的排列（它也會重復包含重復）

得到一組在2中的所有...然后用每個第一個解決方案作為初始猜測來調用它

Answer 6

這里有正確的答案，它大量利用python的神奇功能，幾乎可以為你做任何事情。 我將嘗試用數學和算法來解釋事物，以便您可以將它應用於您想要的任何語言。

所以，你有一個字母{a1, a2, ... a_a}的基數a你的情況） {'A', 'C', 'G', 'T'}和基數為4。你有長度為k字符串，您希望生成漢明距離小於或等於d所有字符串。

首先，你有多少人？ 答案不依賴於您選擇的字符串。 如果你選擇了一個刺，你將得到C(d, k)(a-1)^d字符串，它們與你的字符串有一個漢明距離d 。 所以字符串的總數是：

它幾乎每個參數都呈指數級增長，所以你不會有任何類型的快速算法來查找所有單詞。

那么你將如何推導出能夠生成所有字符串的算法呢？ 請注意，很容易生成一個字符串，該字符串距離您的最多一個漢明距離。 您只需要迭代字符串中的所有字符，並為每個字符嘗試字母表中的每個字母。 正如您將看到的，一些單詞將是相同的。

現在要生成距離字符串兩個漢明距離的所有字符串，您可以應用相同的函數，為前一次迭代中的每個字生成一個漢明距離字。

所以這是一個偽代碼：

function generateAllHamming(str string, distance int): 
    alphabet = ['A', ...]// array of letters in your alphabet
    results = {} // empty set that will store all the results
    prev_strings = {str} // set that have strings from the previous iterations
    // sets are used to get rid of duplicates

    if distance > len(str){ distance = len(str)} // you will not get any new strings if the distance is bigger than length of the string. It will be the same all possible strings.

    for d = 0; d < distance; d++ {
        for one_string in prev_strings {
            for pos = 0; pos < len(one_string); pos++ {
                for char in alphabet {
                    new_str = substitute_char_at_pos(one_string, char, pos)
                    add new_str to set results 
                }
            }
        }

        populate prev_strings with elements from results
    }

    return your results
}

漢明距離的逆

問題描述

6 個解決方案

解決方案1
18 已采納 2013-11-12 22:54:31

解決方案2
5 2013-11-12 22:32:42

解決方案3
5 2013-11-12 22:38:24

解決方案4
2 2013-11-12 23:19:25

解決方案5
1 2013-11-12 22:31:51

解決方案6
0 2015-12-30 05:18:13

漢明距離的逆

問題描述

6 個解決方案

解決方案1 18 已采納 2013-11-12 22:54:31

解決方案2 5 2013-11-12 22:32:42

解決方案3 5 2013-11-12 22:38:24

解決方案4 2 2013-11-12 23:19:25

解決方案5 1 2013-11-12 22:31:51

解決方案6 0 2015-12-30 05:18:13

解決方案1
18 已采納 2013-11-12 22:54:31

解決方案2
5 2013-11-12 22:32:42

解決方案3
5 2013-11-12 22:38:24

解決方案4
2 2013-11-12 23:19:25

解決方案5
1 2013-11-12 22:31:51

解決方案6
0 2015-12-30 05:18:13