Python正则表达式：如何重复模式的重复？

Question

I am looking at a long strand of DNA nucleotides and am looking for sequences that begin with the start code 'AAA' and end with the stop code 'CCC'. 我正在研究一长串DNA核苷酸，我正在寻找以起始代码'AAA'开头并以终止代码'CCC'结束的序列。 Since nucleotides come in triplets, the number of nucleotides between the start and end of every sequence I find must be a multiple of three. 由于核苷酸是三联体，我发现每个序列的起点和终点之间的核苷酸数必须是三的倍数。

For example, 'AAAGGGCCC' is a valid sequence but 'AAAGCCC' is not. 例如，'AAAGGGCCC'是有效序列，但'AAAGCCC'不是。

In addition, before every stop code, I want the longest strand I can find with respect to a particular reading frame. 另外，在每个停止代码之前，我想要找到关于特定阅读框的最长链。

For example, if the DNA were 'AAAGGGAAACCC', then both 'AAAGGGAAACCC' and 'AAACCC' would technically be valid, but since they share the same instance of the stop code, I only want the longest strand of DNA 'AAAGGGAAACCC'. 例如，如果DNA是'AAAGGGAAACCC'，那么'AAAGGGAAACCC'和'AAACCC'在技术上都是有效的，但由于它们共享相同的终止代码实例，我只想要最长的DNA链'AAAGGGAAACCC'。 Also, if my strand were 'AAAAGGCCCCC', I must return 'AAAAGGCCC' AND 'AAAGGCCCC' because they are in different reading frames (One reading frame is mod 3, the other is mod 1.) 另外，如果我的链是'AAAAGGCCCCC'，我必须返回'AAAAGGCCC'和'AAAGGCCCC'因为它们在不同的阅读框中（一个阅读框是mod 3，另一个是mod 1.）

While I think I have the code to search for strings that fulfill the multiple of 3 requirement and don't overlap, I am not sure how to implement the second criteria of keeping the same reading frame. 虽然我认为我有代码来搜索满足3个要求的倍数并且不重叠的字符串，但我不确定如何实现保持相同阅读框的第二个标准。 My code below would just return the longest strings that don't overlap, but does not distinguish between reading frames, so in the above example it would catch 'AAAAGGCCC' but not 'AAAGGCCCC': 我下面的代码只返回不重叠的最长字符串，但不区分读取帧，所以在上面的示例中它会捕获'AAAAGGCCC'而不是'AAAGGCCCC'：

match = re.finditer(r"AAA\w{3}{%d}BBB$"% (minNucleotide-6, math.ceil((minNucleotide-6)/3))

Sorry for being long-winded and thank you for taking a look! 很抱歉啰嗦，谢谢你看看！

Answer 1

Use a positive lookahead assertion . 使用积极的先行断言。 This allows you to reapply the regex at each character in the string, thus making it possible to find all overlapping matches because the lookahead assertion doesn't consume any characters like a normal match would. 这允许您在字符串中的每个字符处重新应用正则表达式，从而可以找到所有重叠匹配，因为前瞻断言不会消耗任何字符，如正常匹配。 Since you still need to match some actual text, you can use a capturing group for that. 由于您仍需要匹配一些实际文本，因此您可以使用捕获组。

Since re.findall() returns the contents of the capturing groups instead of the full regex matches (which would all be '' ), you can use: 由于re.findall()返回捕获组的内容而不是完整的正则表达式匹配（这些都是'' ），因此您可以使用：

>>> import re
>>> re.findall(r"(?=(AAA(?:\w{3})*?CCC))", "AAAAGGCCCC")
['AAAAGGCCC', 'AAAGGCCCC']

As a commented Python function: 作为一个评论的Python函数：

def find_overlapping(sequence):
    return re.findall(
    """(?=        # Assert that the following regex could be matched here:
     (            # Start of capturing group number 1.
      AAA         # Match AAA.
      (?:         # Start of non-capturing group, matching...
       [AGCT]{3}  # a DNA triplet
      )*?         # repeated any number of times, as few as possible.
      CCC         # Match CCC.
     )            # End of capturing group number 1. 
    )             # End of lookahead assertion.""", 
    sequence, re.VERBOSE)

Answer 2

The simplest pattern that comes to mind is: 想到的最简单的模式是：

'AAA(\w{3})*CCC'
            ^^^ stop code
           ^ zero or more of…
    ^     ^ a group of…
     ^^^^^ three characters
 ^^^ start code

If you have additional requirements on the number of three-character groups, like “at least two such groups”, you can now easily replace the star character in the regular expression with what you need. 如果您对三个字符组的数量有其他要求，例如“至少两个这样的组”，您现在可以轻松地将正则表达式中的星形字符替换为您需要的字符。

As for the longest match and different frames, I'm not sure. 至于最长的比赛和不同的帧，我不确定。 Technically the star character already is greedy, that is will match the longest string possible, so that should fulfill your requirements. 从技术上讲，明星角色已经贪婪，这将匹配可能的最长字符串，因此应该满足您的要求。 But I fear this feature and the requirement to not to share substrings while in a single frame will interact badly. 但是我担心这个功能以及在单个帧中不共享子串的要求会很糟糕地进行交互。

I think the clearest way would be to ask the regex engine to provide you with all matches regardless of length and frame (as long as the inner part's length is divisible by 3), then sort out the situation outside regular expressions. 我认为最明确的方法是要求正则表达式引擎为您提供所有匹配，无论长度和帧如何（只要内部部分的长度可以被3整除），然后在正则表达式之外排除情况。

If you really want to use regex engine to do that, there's one way I can think of—by running a specific regex three times, once for each frame. 如果你真的想使用正则表达式引擎来做到这一点，我可以想到一种方法 - 通过运行特定的正则表达式三次，每帧一次。 These regexes would be: 这些正则表达式将是：

^(?:\w{3})*AAA(\w{3})*CCC
^(?:\w{3})*\wAAA(\w{3})*CCC
^(?:\w{3})*\w\wAAA(\w{3})*CCC

As you can see, each of them firstly matches 3k, 3k+1 or 3k+2 characters—so that the AAA starting code will start at different frames. 如您所见，它们中的每一个首先匹配3k，3k + 1或3k + 2个字符 - 这样AAA起始代码将从不同的帧开始。 To get the matched part you'll need to inspect the returned match object. 要获得匹配的部分，您需要检查返回的匹配对象。 And I really don't know what will happen with overlapping sequences. 我真的不知道重叠序列会发生什么。

Python正则表达式：如何重复模式的重复？

问题描述

2 个解决方案

解决方案1
4 已采纳 2013-09-08 06:00:21

解决方案2
1 2013-09-08 05:41:36

Python正则表达式：如何重复模式的重复？

问题描述

2 个解决方案

解决方案1 4 已采纳 2013-09-08 06:00:21

解决方案2 1 2013-09-08 05:41:36

解决方案1
4 已采纳 2013-09-08 06:00:21

解决方案2
1 2013-09-08 05:41:36