找到不匹配模式的效率

Question

I'm working on a simple bioinformatics problem. 我正在研究一个简单的生物信息学问题。 I have a working solution, but it is absurdly inefficient. 我有一个有效的解决方案，但这是非常低效的。 How can I increase my efficiency? 我怎样才能提高效率？

Problem: 问题：

Find patterns of length k in the string g , given that the k -mer can have up to d mismatches. 在字符串g找到长度为k模式，假设k mer最多可能有d不匹配。

And these strings and patterns are all genomic--so our set of possible characters is {A, T, C, G} . 这些字符串和模式都是基因组的 - 所以我们的可能字符集是{A, T, C, G} 。

I'll call the function FrequentWordsMismatch(g, k, d) . 我将调用函数FrequentWordsMismatch(g, k, d) 。

So, here are a few helpful examples: 所以，这里有一些有用的例子：

FrequentWordsMismatch('AAAAAAAAAA', 2, 1) → ['AA', 'CA', 'GA', 'TA', 'AC', 'AG', 'AT'] FrequentWordsMismatch('AAAAAAAAAA', 2, 1) → ['AA', 'CA', 'GA', 'TA', 'AC', 'AG', 'AT']

Here's a much longer example, if you implement this and want to test: 这是一个更长的例子，如果你实现这个并想测试：

FrequentWordsMisMatch('CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC', 10, 2) → ['GCACACAGAC', 'GCGCACACAC'] FrequentWordsMisMatch('CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC', 10, 2) → ['GCACACAGAC', 'GCGCACACAC']

With my naive solution, that second example could easily take ~60 seconds, though the first one is pretty quick. 凭借我天真的解决方案，第二个例子很容易花费大约60秒，尽管第一个例子非常快。

Naive solution: 天真的解决方案：

My idea was to, for every k -length segment in g, find every possible "neighbor" (eg other k -length segments with up to d mismatches) and add those neighbors as keys to a dictionary. 我的想法是，对于g中的每个k长度段，找到每个可能的“邻居”（例如，具有多达d个不匹配的其他k长度段）并将这些邻居添加为字典的键。 I then count how many times each one of those neighbor kmers show up in the string g , and record those in the dictionary. 然后我计算每个邻居kmers出现在字符串g中的次数 ，并将其记录在字典中。

Obviously that's a kinda shitty way to do that, since the amount of neighbors scales like crazy as k and d increase, and having to scan through the strings with each of those neighbors makes this implementation terribly slow. 显然，这是一种有点糟糕的方式，因为邻居的数量像k和d一样疯狂地扩展，并且必须通过每个邻居扫描字符串使得这种实现非常缓慢。 But alas, that's why I'm asking for help. 但是，这就是我要求帮助的原因。

I'll put my code below. 我会把我的代码放在下面。 There're definitely a lot of novice mistakes to unpack, so thanks for your time and attention. 打开包装肯定会有很多新手错误，所以感谢您的时间和精力。

def FrequentWordsMismatch(g, k, d):
    '''
    Finds the most frequent k-mer patterns in the string g, given that those 
    patterns can mismatch amongst themselves up to d times

    g (String): Collection of {A, T, C, G} characters
    k (int): Length of desired pattern
    d (int): Number of allowed mismatches
    '''
    counts = {}
    answer = []

    for i in range(len(g) - k + 1):
        kmer = g[i:i+k]
        for neighborkmer in Neighbors(kmer, d):
            counts[neighborkmer] = Count(neighborkmer, g, d)

    maxVal = max(counts.values())

    for key in counts.keys():
        if counts[key] == maxVal:
            answer.append(key)

    return(answer)


def Neighbors(pattern, d):
    '''
    Find all strings with at most d mismatches to the given pattern

    pattern (String): Original pattern of characters
    d (int): Number of allowed mismatches
    '''
    if d == 0:
        return [pattern]

    if len(pattern) == 1:
        return ['A', 'C', 'G', 'T']

    answer = []

    suffixNeighbors = Neighbors(pattern[1:], d)

    for text in suffixNeighbors:
        if HammingDistance(pattern[1:], text) < d:
            for n in ['A', 'C', 'G', 'T']:
                answer.append(n + text)
        else:
            answer.append(pattern[0] + text)

    return(answer)


def HammingDistance(p, q):
    '''
    Find the hamming distance between two strings

    p (String): String to be compared to q
    q (String): String to be compared to p
    '''
    ham = 0 + abs(len(p)-len(q))

    for i in range(min(len(p), len(q))):
        if p[i] != q[i]:
            ham += 1

    return(ham)


def Count(pattern, g, d):
    '''
    Count the number of times that the pattern occurs in the string g, 
    allowing for up to d mismatches

    pattern (String): Pattern of characters
    g (String): String in which we're looking for pattern
    d (int): Number of allowed mismatches
    '''
    return len(MatchWithMismatch(pattern, g, d))

def MatchWithMismatch(pattern, g, d):
    '''
    Find the indicies at which the pattern occurs in the string g, 
    allowing for up to d mismatches

    pattern (String): Pattern of characters
    g (String): String in which we're looking for pattern
    d (int): Number of allowed mismatches
    '''
    answer = []
    for i in range(len(g) - len(pattern) + 1):
        if(HammingDistance(g[i:i+len(pattern)], pattern) <= d):
            answer.append(i)
    return(answer)

More tests 更多测试

FrequentWordsMismatch('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1) → ['ATGC', 'ATGT', 'GATG']

FrequentWordsMismatch('AGTCAGTC', 4, 2) → ['TCTC', 'CGGC', 'AAGC', 'TGTG', 'GGCC', 'AGGT', 'ATCC', 'ACTG', 'ACAC', 'AGAG', 'ATTA', 'TGAC', 'AATT', 'CGTT', 'GTTC', 'GGTA', 'AGCA', 'CATC']

FrequentWordsMismatch('AATTAATTGGTAGGTAGGTA', 4, 0) → ["GGTA"]

FrequentWordsMismatch('ATA', 3, 1) → ['GTA', 'ACA', 'AAA', 'ATC', 'ATA', 'AGA', 'ATT', 'CTA', 'TTA', 'ATG']

FrequentWordsMismatch('AAT', 3, 0) → ['AAT']

FrequentWordsMismatch('TAGCG', 2, 1)  → ['GG', 'TG']

Answer 1

Going on your problem description alone and not your examples (for the reasons I explained in the comment), one approach would be: 单独讨论您的问题描述而不是您的示例（由于我在评论中解释的原因），一种方法是：

s = "CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGC"\
    "CGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGG"\
    "CCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCG"\
    "GTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACAC"\
    "ACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"

def frequent_words_mismatch(g,k,d):
    def num_misspellings(x,y):
        return sum(xx != yy for (xx,yy) in zip(x,y))

    seen = set()
    for i in range(len(g)-k+1):
        seen.add(g[i:i+k])

    # For each unique sequence, add a (key,bin) pair to the bins dictionary
    #  (The bin is initialized to a list containing only the sequence, for now)
    bins = {seq:[seq,] for seq in seen}
    # Loop again through the unique sequences...
    for seq in seen:
        # Try to fit it in *all* already-existing bins (based on bin key)
        for bk in bins:
            # Don't re-add seq to it's own bin
            if bk == seq: continue
            # Test bin keys, try to find all appropriate bins
            if num_misspellings(seq, bk) <= d:
                bins[bk].append(seq)

    # Get a list of the bin keys (one for each unique sequence) sorted in order of the
    #   number of elements in the corresponding bins
    sorted_keys = sorted(bins, key= lambda k:len(bins[k]), reverse=True)

    # largest_bin_key will be the key of the largest bin (there may be ties, so in fact
    #   this is *a* key of *one of the bins with the largest length*).  That is, it'll
    #   be the sequence (found in the string) that the most other sequences (also found
    #   in the string) are at most d-distance from.
    largest_bin_key = sorted_keys[0]

    # You can return this bin, as your question description (but not examples) indicate:
    return bins[largest_bin_key]

largest_bin = frequent_words_mismatch(s,10,2)
print(len(largest_bin))     # 13
print(largest_bin)

The (this) largest bin contains: （这个）最大的bin包含：

['CGGCCGCCGG', 'GGGCCGGCGG', 'CGGCCGGCGC', 'AGGCGGCCGG', 'CAGGCGCCGG',
 'CGGCCGGCCG', 'CGGTAGCCGG', 'CGGCGGCCGC', 'CGGGCGCCGG', 'CCGGCGCCGG',
 'CGGGCCCCGG', 'CCGCCGGCGG', 'GGGCCGCCGG']

It's O (n**2) where n is the number of unique sequences and completes on my computer in around 0.1 seconds. 它是O （n ** 2），其中n是唯一序列的数量，并在我的计算机上在大约0.1秒内完成。

Answer 2

The problem description is ambiguous in several ways, so I'm going by the examples. 问题描述在几个方面是模棱两可的，所以我将通过这些例子。 You seem to want all k -length strings from the alphabet (A, C, G, T} such that the number of matches to contiguous substrings of g is maximal - where "a match" means character-by-character equality with at most d character inequalities. 您似乎想要字母表中的所有k长度字符串(A, C, G, T} ，使得与g连续子串的匹配数最大 - 其中“匹配”表示最多与字符相等d字符不等式。

I'm ignoring that your HammingDistance() function makes something up even when inputs have different lengths, mostly because it doesn't make much sense to me ;-) , but partly because that isn't needed to get the results you want in any of the examples you gave. 我忽略了你的HammingDistance()函数即使在输入长度不同的情况下也会产生一些东西，主要是因为它对我没有多大意义;-)，但部分是因为不需要得到你想要的结果你提供的任何例子。

The code below produces the results you want in all the examples, in the sense of producing permutations of the output lists you gave. 下面的代码生成了所有示例中所需的结果，从而产生了您给出的输出列表的排列。 If you want canonical outputs, I'd suggest sorting an output list before returning it. 如果你想要规范输出，我建议在返回之前对输出列表进行排序。

The algorithm is pretty simple, but relies on itertools to do the heavy combinatorial lifting "at C speed". 该算法非常简单，但依赖于itertools来“以C速度”进行重组合提升。 All the examples run in well under a second total. 所有的例子都在第二个总数下运行良好。

For each length- k contiguous substring of g , consider all combinations(k, d) sets of d distinct index positions. 对于g每个长度k连续子串，考虑d不同索引位置的所有combinations(k, d)组。 There are 4**d ways to fill those index positions with letters from {A, C, G, T} , and each such way is "a pattern" that matches the substring with at most d discrepancies. 有4**d方法用来自{A, C, G, T}字母填充那些索引位置，并且每种方式都是“一种模式”，其匹配具有最多d差异的子字符串。 Duplicates are weeded out by remembering the patterns already generated; 通过记住已生成的模式来清除重复项; this is faster than making heroic efforts to generate only unique patterns to begin with. 这比制作英雄的努力更快地生成只有独特的模式开始。

So, in all, the time requirement is O(len(g) * k**d * 4**d) = O(len(g) * (4*k)**d , where k**d is, for reasonably small values of k and d , an overstated standin for the binomial coefficent combinations(k, d) . The important thing to note is that - unsurprisingly - it's exponential in d . 所以，总的来说，时间要求是O(len(g) * k**d * 4**d) = O(len(g) * (4*k)**d ，其中k**d是，对于k和d相当小的值，二项式系数combinations(k, d)的夸大其词。重要的是要注意的是 - 毫不奇怪 - 它在d是指数的。

def fwm(g, k, d):
    from itertools import product, combinations
    from collections import defaultdict

    all_subs = list(product("ACGT", repeat=d))
    all_ixs = list(combinations(range(k), d))
    patcount = defaultdict(int)

    for starti in range(len(g)):
        base = g[starti : starti + k]
        if len(base) < k:
            break
        patcount[base] += 1
        seen = set([base])
        basea = list(base)
        for ixs in all_ixs:
            saved = [basea[i] for i in ixs]
            for newchars in all_subs:
                for i, newchar in zip(ixs, newchars):
                    basea[i] = newchar
                candidate = "".join(basea)
                if candidate not in seen:
                    seen.add(candidate)
                    patcount[candidate] += 1
            for i, ch in zip(ixs, saved):
                basea[i] = ch

    maxcount = max(patcount.values())
    return [p for p, c in patcount.items() if c == maxcount]

EDIT: Generating Patterns Uniquely 编辑：独特地生成模式

Rather than weed out duplicates by keeping a set of those seen so far, it's straightforward enough to prevent generating duplicates to begin with. 而不是通过保留迄今为止看到的那些副本来清除重复项，它足以直接防止生成重复项。 In fact, the following code is shorter and simpler, although somewhat subtler. 事实上，下面的代码更短更简单，虽然有点微妙。 In return for less redundant work, there are layers of recursive calls to the inner() function. 作为少量冗余工作的回报，有一些对inner()函数的递归调用。 Which way is faster appears to depend on the specific inputs. 哪种方式更快似乎取决于具体的输入。

def fwm(g, k, d):
    from collections import defaultdict

    patcount = defaultdict(int)
    alphabet = "ACGT"
    allbut = {ch: tuple(c for c in alphabet if c != ch)
              for ch in alphabet}

    def inner(i, rd):
        if not rd or i == k:
            patcount["".join(base)] += 1
            return
        inner(i+1, rd)
        orig = base[i]
        for base[i] in allbut[orig]:
            inner(i+1, rd-1)
        base[i] = orig

    for i in range(len(g) - k + 1):
        base = list(g[i : i + k])
        inner(0, d)

    maxcount = max(patcount.values())
    return [p for p, c in patcount.items() if c == maxcount]

找到不匹配模式的效率

问题描述

2 个解决方案

解决方案1
2 2018-06-16 19:01:46

解决方案2
2 已采纳 2018-06-17 01:40:45

EDIT: Generating Patterns Uniquely 编辑：独特地生成模式

找到不匹配模式的效率

问题描述

2 个解决方案

解决方案1 2 2018-06-16 19:01:46

解决方案2 2 已采纳 2018-06-17 01:40:45

EDIT: Generating Patterns Uniquely 编辑：独特地生成模式

解决方案1
2 2018-06-16 19:01:46

解决方案2
2 已采纳 2018-06-17 01:40:45