Regular expression help

Question

I am trying to create a regex in Python 3 that matches 7 characters (eg. >AB0012) separated by an unknown number of characters then matching another 6 characters(eg. aaabbb or bbbaaa). My input string might look like this:

>AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

This is the regex that I have come up with:

matches = re.findall(r'(>.{7})(aaabbb|bbbaaa)', mystring)  
print(matches)

The output I am trying to product would look like this:

[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'aaabbb')]

I read through the Python documentation, but I couldn't find how to match an unknown distance between two portions of a regex. Is there some sort of wildcard character that would allow me to complete my regex? Thanks in advance for the help!

EDIT:
If I use *? in my code like this:

mystring = str(input("Paste promoters here: "))
matches = re.findall(r'(>.{7})*?(aaabbb|bbbaaa)', mystring)
print(matches)

My output looks like this:
[('>CD00192', 'aaabbb'), ('', 'bbbaaa'), ('', 'aaabbb')]

*The second and third items in the list are missing the >CD00192 and >ZP01990, respectively. How can I have the regex include these characters in the list?

Answer 1

Here's a non regular expression approach. Split on ">" (your data will start from 2nd element onwards), then since you don't care what those 7 characters are, so start checking from 8th character onwards till 14th character.

>>> string=""" AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa""" 
>>> for i in string.split(">")[1:]:
...   if i[7:13] in ["aaabbb","bbbaaa"]:
...     print ">" + i[:13]
...
>CD00192aaabbb

Answer 2

I have a code that gives also the positions.

Here's the simple version of this code:

import re
from collections import OrderedDict

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'

regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')

dic = OrderedDict()


# Finding the result
for mat in regx.finditer(ch):
    chunk,head = mat.groups()
    headstart = mat.start()
    dic[(headstart,head)] = [(headstart+six.start(),six.start(),six.group())
                             for six in rag.finditer(chunk)]


# Diplaying the result
for (headstart,head),li in dic.iteritems():
    print '{:>10} {}'.format(headstart,head)
    for x in li:
        print '{0[0]:>10} {0[1]:>6} {0[2]}'.format(x)

result

>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa 

        24 CD00192
        31      8 aaabbb
        41     18 bbbaaa
        52     29 bbbaaa
        62     39 aaabbb
        69 ZP01990
        95     27 aaabbb
       136 SE45789
       148     13 aaabbb
       172     37 bbbaaa

The same code, in a functional manner, using generators :

import re
from itertools import imap
from collections import OrderedDict

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'

regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')

gen = ((mat.groups(),mat.start()) for mat in regx.finditer(ch)) 


dic = OrderedDict(((headstart,head),
                   [(headstart+six.start(),six.start(),six.group())
                    for six in rag.finditer(chunk)])
                  for (chunk,head),headstart in gen)


print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
                '\n'.join(imap('{0[0]:>10} {0[1]:>6} {0[2]}'.format,li))
                for (headstart,head),li in dic.iteritems())

.

EDIT

I measured the execution's times.

For each code I measured the creation of the dictionary and the displaying separately.

The code using generators (the second) is 7.4 times more rapid to display the result ( 0.020 seconds) than the other one (0.148 seconds)

But surprisingly for me, the code with generators takes 47 % more time (0.000718 seconds) than the other (0.000489 seconds) to compute the dictionary.

.

EDIT 2

Another way to do:

import re
from collections import OrderedDict
from itertools import imap

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'


regx = re.compile('((?<=>).{7})|(aaabbb|bbbaaa)')

def collect(ch):
    li = []
    dic = OrderedDict()

    gen = ( (x.start(),x.group(1),x.group(2)) for x in regx.finditer(ch))
    for st,g1,g2 in gen:
        if g1:
            if li:
                dic[(stprec,g1prec)] = li
            li,stprec,g1prec = [],st,g1
        elif g2:
            li.append((st,g2))
    if li:
        dic[(stprec,g1prec)] = li
    return dic


dic = collect(ch)

print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
                '\n'.join(imap('{0[0]:>10}   {0[1]}'.format,li))
                for (headstart,head),li in dic.iteritems())

result

>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa 

        24 CD00192
        31   aaabbb
        41   bbbaaa
        52   bbbaaa
        62   aaabbb
        69 ZP01990
        95   aaabbb
       136 SE45789
       148   aaabbb
       172   bbbaaa

This code compute dic in 0.00040 seconds and displays it in 0.0321 seconds

.

EDIT 3

To answer to your question, you have no other possibility than keeping each current value among 'CD00192','ZP01990','SE45789' etc under a name (I don't like to say "in a variable" in Python, because there are no variables in Python. But you can read "under a name" as if I had written "in a variable" )

And for that, you must use finditer()

Here's the code for this solution:

import re

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'

regx = re.compile('(>.{7})|(aaabbb|bbbaaa)')

matches = []
for mat in regx.finditer(ch):
    g1,g2= mat.groups()
    if g1:
        head = g1
    else:
        matches.append((head,g2))

print matches

result

>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa 

[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>SE45789', 'aaabbb'), ('>SE45789', 'bbbaaa')]

My preceding codes are more complicated because they catch the positions and gather the values 'aaabbb' and 'bbbaaa' of one header among 'CD00192','ZP01990','SE45789' etc in a list.

Answer 3

zero or more characters can be matched using * , so a* would match "" , "a" , "aa" etc. + matches one or more character.

You will perhaps want to make the quantifier ( + or * ) lazy by using +? or *? as well.

See regular-expressions.info for more details.

Answer 4

Try this:

>>> r1 = re.findall(r'(>.{7})[^>]*?(aaabbb)', s)  
>>> r2 = re.findall(r'(>.{7})[^>]*?(bbbaaa)', s)  
>>> r1 + r2
[('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990',     'bbbaaa')]

Regular expression help

Question

4 answers

solution1
5 2011-04-23 14:21:33

solution2
1 ACCPTED 2011-04-23 16:24:40

EDIT

EDIT 2

EDIT 3

solution3
0 2011-04-23 14:13:04

solution4
0 2011-04-23 16:52:05

Regular expression help

Question

4 answers

solution1 5 2011-04-23 14:21:33

solution2 1 ACCPTED 2011-04-23 16:24:40

EDIT

EDIT 2

EDIT 3

solution3 0 2011-04-23 14:13:04

solution4 0 2011-04-23 16:52:05

solution1
5 2011-04-23 14:21:33

solution2
1 ACCPTED 2011-04-23 16:24:40

solution3
0 2011-04-23 14:13:04

solution4
0 2011-04-23 16:52:05