简体   繁体   中英

Python Regular Expressions: How to repeat a repeat of a pattern?

I am looking at a long strand of DNA nucleotides and am looking for sequences that begin with the start code 'AAA' and end with the stop code 'CCC'. Since nucleotides come in triplets, the number of nucleotides between the start and end of every sequence I find must be a multiple of three.

For example, 'AAAGGGCCC' is a valid sequence but 'AAAGCCC' is not.

In addition, before every stop code, I want the longest strand I can find with respect to a particular reading frame.

For example, if the DNA were 'AAAGGGAAACCC', then both 'AAAGGGAAACCC' and 'AAACCC' would technically be valid, but since they share the same instance of the stop code, I only want the longest strand of DNA 'AAAGGGAAACCC'. Also, if my strand were 'AAAAGGCCCCC', I must return 'AAAAGGCCC' AND 'AAAGGCCCC' because they are in different reading frames (One reading frame is mod 3, the other is mod 1.)

While I think I have the code to search for strings that fulfill the multiple of 3 requirement and don't overlap, I am not sure how to implement the second criteria of keeping the same reading frame. My code below would just return the longest strings that don't overlap, but does not distinguish between reading frames, so in the above example it would catch 'AAAAGGCCC' but not 'AAAGGCCCC':

match = re.finditer(r"AAA\w{3}{%d}BBB$"% (minNucleotide-6, math.ceil((minNucleotide-6)/3))

Sorry for being long-winded and thank you for taking a look!

Use a positive lookahead assertion . This allows you to reapply the regex at each character in the string, thus making it possible to find all overlapping matches because the lookahead assertion doesn't consume any characters like a normal match would. Since you still need to match some actual text, you can use a capturing group for that.

Since re.findall() returns the contents of the capturing groups instead of the full regex matches (which would all be '' ), you can use:

>>> import re
>>> re.findall(r"(?=(AAA(?:\w{3})*?CCC))", "AAAAGGCCCC")
['AAAAGGCCC', 'AAAGGCCCC']

As a commented Python function:

def find_overlapping(sequence):
    return re.findall(
    """(?=        # Assert that the following regex could be matched here:
     (            # Start of capturing group number 1.
      AAA         # Match AAA.
      (?:         # Start of non-capturing group, matching...
       [AGCT]{3}  # a DNA triplet
      )*?         # repeated any number of times, as few as possible.
      CCC         # Match CCC.
     )            # End of capturing group number 1. 
    )             # End of lookahead assertion.""", 
    sequence, re.VERBOSE)

The simplest pattern that comes to mind is:

'AAA(\w{3})*CCC'
            ^^^ stop code
           ^ zero or more of…
    ^     ^ a group of…
     ^^^^^ three characters
 ^^^ start code

If you have additional requirements on the number of three-character groups, like “at least two such groups”, you can now easily replace the star character in the regular expression with what you need.

As for the longest match and different frames, I'm not sure. Technically the star character already is greedy, that is will match the longest string possible, so that should fulfill your requirements. But I fear this feature and the requirement to not to share substrings while in a single frame will interact badly.

I think the clearest way would be to ask the regex engine to provide you with all matches regardless of length and frame (as long as the inner part's length is divisible by 3), then sort out the situation outside regular expressions.

If you really want to use regex engine to do that, there's one way I can think of—by running a specific regex three times, once for each frame. These regexes would be:

^(?:\w{3})*AAA(\w{3})*CCC
^(?:\w{3})*\wAAA(\w{3})*CCC
^(?:\w{3})*\w\wAAA(\w{3})*CCC

As you can see, each of them firstly matches 3k, 3k+1 or 3k+2 characters—so that the AAA starting code will start at different frames. To get the matched part you'll need to inspect the returned match object. And I really don't know what will happen with overlapping sequences.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM