简体   繁体   中英

regex: matching a repeating sequence

I'm trying to construct a regular expression that will match a repeating DNA sequence of 2 characters. These characters can be the same.

The regex should match a repeating sequence of 2 characters at least 3 times and, here are some examples:

regex should match on:

  • ATATAT
  • GAGAGAGA
  • CCCCCC

and should not match on:

  • ACAC
  • ACGTACGT

So far I've come up with the following regular expressions:

[ACGT]{2}

this captures any sequence consisting of exactly two characters (A, C, G or T). Now I want to repeat this pattern at least three times, so I tried the following regular expressions:

[ACGT]{2}{3,}
([ACGT]{2}){3,}

Unfortunately, the first one raises a 'multiple repeat' error (Python), while the second one will simply match any sequence with 6 characters consisting of A, C, G and T.

Is there anyone that can help me out with this regular expression? Thanks in advance.

You could perhaps make use of backreferences.

([ATGC]{2})\1{2,}

\\1 is the backreference referring to the first capture group and will be what you have captured.

regex101 demo

One:

(AT){3}

Two

(GA){4}

Three

C{6}

Combining them!

(C{6}|(GA){4}|(AT){3})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM