简体   繁体   English

嵌套循环可在Python中查找所有可能的组合

[英]Nested loops to find all possible combinations in Python

Hi all I have a bioinformatics problem I could do with help on. 大家好,我遇到了生物信息学方面的问题,可以提供帮助。 Its quite long but I'll try break it down into smaller sections any help is wonderful. 它的时间很长,但是如果有帮助,我会尝试将其分解为较小的部分。

I have a sequence of RNA length 'n' made up of 4 letters A,U,C,G that is imported as a string into Python, that can fold to make a loop. 我有一个RNA长度'n'的序列,该序列由4个字母A,U,C,G组成,将其作为字符串导入到Python中,可以折叠成一个环。 A loop is made by matching pairs of letters from the sequence so that A is with U, C is with G and G is with U so that the string folds back on itself. 通过匹配序列中的字母对来进行循环,以便A与U,C与G和G与U在一起,从而使字符串向后折叠。

The catch is that there must be three or more letters next to each other which form a pair, more than or equal to 3 letters forming a pair in a row and that there must be a gap between the sections of at least 3 letters as well. 要注意的是,必须有三个或更多个彼此相邻的字母组成一个对,大于或等于3个字母连续成对,并且各部分之间也必须有至少3个字母之间的间隔。

I tried to post a picture but i don't have enough reputation points :( 我试图发布图片,但是我没有足够的信誉点:(

In the journal I'm referencing the author talks about a nested loop method to find all possible combinations where this is possible and then containing them in a group to be called upon later. 在期刊中,我引用的作者谈论的是嵌套循环方法,以找到所有可能的组合,然后将它们包含在组中,以供稍后调用。

My problem is writing the nested loops as I'm new to programming and python. 我的问题是编写嵌套循环,因为我是编程和python的新手。 As well as storing the sequences in a way where it is possible to identify the pairs and possibly add them together. 以及以可能识别对并可能将它们加在一起的方式存储序列。

Again any help would be great and if anything is unclear please let me know 再说一次,任何帮助都会很棒,如果有任何不清楚的地方,请告诉我

edit: 编辑:

an example would be seq='aggcuugaguuu' where one of the outputs showed the pairing of seq[0:2] with seq[9:11] meaning the code forms like a U-shape. 一个示例是seq ='aggcuugaguuu',其中输出之一显示seq [0:2]与seq [9:11]配对,这意味着代码形式像U形。

If you imagine the string as a physical piece of string and hold it at 3 points and hold it at three different points and then touched the points together it would cause the string to form a loop. 如果您将字符串想象成是物理的字符串,并将其固定在3个点上,然后将其固定在三个不同的点上,然后将这些点接触在一起,则会导致字符串形成一个循环。 I'm looking to identify the 6 points used. 我正在寻找使用的6点。

I'm not looking for code to be written for me I'd just like to know a method for composing the code. 我不是要为我编写代码,我只是想知道一种编写代码的方法。

I tried a method where seq1=input code and seq2=reverse input code and moved seq2 along seq1 looking for three neighbouring pairs but this didn't give me the correct output. 我尝试了seq1 =输入代码和seq2 =反向输入代码的方法,并沿着seq1移动了seq2来寻找三个相邻的对,但这并没有给我正确的输出。

Have you considered using product from itertools . 您是否考虑过使用itertools的产品。 Then you can iterate over result and choose only these results, that you like. 然后,您可以遍历结果并仅选择所需的这些结果。

If your RNA isn't terribly long (a thousand of bases probably OK; hundreds of thousands definitely not OK), you can get away with a simple O(n^3) algorithm. 如果您的RNA并不是很长(一千个碱基可能还可以;十万个碱基绝对不可以),那么您可以使用简单的O(n ^ 3)算法逃脱现实。 O(n^3) means that the execution time is, at worst, proportional to the cube of the number of bases. O(n ^ 3)表示执行时间最坏的情况是与基数的立方成正比。 The author mentioning nested loops hints heavily toward this simple but rather slow method. 作者提到嵌套循环时,很明显地暗示了这种简单但相当缓慢的方法。

def find_loops(rna, min_pairs=3, min_loop=3):
    n = len(rna)
    result = []
    for loop_start in xrange(min_pairs, n - min_pairs - min_loop + 1):
        for loop_end in xrange(loop_start + min_loop, n - min_pairs):
            if (loop_end - loop_start < min_loop + 2 or 
                    not base_pair(rna[loop_start], rna[loop_end - 1])):
                max_pairs = min(loop_start, n - loop_end)
                for k in xrange(max_pairs):
                    if not base_pair(rna[loop_start - k - 1], rna[loop_end + k]):
                        break
                else:
                    k = max_pairs
                if k >= min_pairs:
                    result.append((loop_start - k, k, loop_end - loop_start))
    return result

def base_pair(x, y):
    return (x == 'A' and y == 'U' or
            x == 'C' and y == 'G' or
            x == 'G' and y == 'C' or
            x == 'U' and y == 'A')

This iterates over all possible beginnings and ends of the RNA loop, and then walks away from the ends of the potential loop, in both directions, as long as the bases still pair. 这会遍历RNA环的所有可能的起点和末端,然后在两个方向上都远离潜在环的末端,只要碱基仍然配对即可。 When it reaches a pair of mismatched bases, it stops and checks that it's got at least the minimum number of pairs. 当它达到一对不匹配的碱基时,它将停止并检查其是否至少具有最小对数。 If it has, it adds the loop to the list of results. 如果有,它将循环添加到结果列表中。

The first if is there to avoid listing loops that could be "zipped" even tighter. 第一个if可以避免列出可能被“压缩”甚至更紧密的循环。 As the condition reads, a loop can not be zipped tighter if it's either too short (less than five bases), or its ends do not match. 根据条件的读取,如果循环太短(小于5个基数),或者其末端不匹配,则无法将其压缩得更紧。

The result is a list of tuples, one for each possible loop, of the form (start_pos, pair_count, loop_length) . 结果是一个元组列表,每个可能的循环一个,格式为(start_pos, pair_count, loop_length) That means that a sequence of pair_count bases, starting from base number start_pos , is followed by a loop of loop_length bases, followed by the complementary sequence in reverse. 这意味着,序列pair_count个碱基,从基编号开始start_pos ,之后的环loop_length碱,随后在反向互补序列。 The antisense copy of the sequence starts at base start_pos + pair_count + loop_length . 序列的反义副本从基本start_pos + pair_count + loop_length First base is number 0, not 1 (we're programmers here). 第一个底数是0,而不是1(我们是程序员)。

An example might make this clearer: print find_loops('GGGGAUUACAGCGUGUAAUCAAUA') returns [(4, 3, 13), (3, 7, 3)] , that is, it finds two loops: 下面的示例可能更清楚: print find_loops('GGGGAUUACAGCGUGUAAUCAAUA')返回[(4, 3, 13), (3, 7, 3)] print find_loops('GGGGAUUACAGCGUGUAAUCAAUA') [(4, 3, 13), (3, 7, 3)] ,即找到两个循环:

  • At position 4, three bases, AUU , enclose a loop of 13 bases, and bind to the AAU at position 20; 在位置4,三个碱基AUU围成一个13个碱基的环,并在位置20绑定到AAU
  • At position 3, seven bases, GAUUACA , enclose a loop of three bases, and bind to the UGUAAUC at position 13. 在位置3,有七个碱基GAUUACA围成三个碱基的环,并在位置13与UGUAAUC绑定。

Without the first if , the function would also return loops like (3, 6, 5) (ie GAUUAC at position 3 encloses a loop of five bases and binds to the GUAAUC at position 14), which is the same loop as (3, 7, 3) above, only not zipped as tightly as it would go. 如果没有第一个if ,则该函数还会返回类似( GAUUAC )的循环(即,位置3的GAUUAC包含五个碱基的循环,并在位置14绑定到GUAAUC ),该循环与(3,上面的7、3),但不要紧紧拉紧拉链。

Hope this helps. 希望这可以帮助。 If you need a faster algorithm, I'm sure there's a dynamic programming solution that works with longer strings. 如果您需要更快的算法,那么我敢肯定有一种适用于更长字符串的动态编程解决方案。 Let me know and I'll think about it. 让我知道,我会考虑的。 It won't be nearly as easy to understand, though... 但是,它不会那么容易理解。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM