I am trying to create a regex in Python 3 that matches 7 characters (eg. >AB0012) separated by an unknown number of characters then matching another 6 characters(eg. aaabbb or bbbaaa). My input string might look like this:
>AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
This is the regex that I have come up with:
matches = re.findall(r'(>.{7})(aaabbb|bbbaaa)', mystring)
print(matches)
The output I am trying to product would look like this:
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'aaabbb')]
I read through the Python documentation, but I couldn't find how to match an unknown distance between two portions of a regex. Is there some sort of wildcard character that would allow me to complete my regex? Thanks in advance for the help!
EDIT:
If I use *?
in my code like this:
mystring = str(input("Paste promoters here: "))
matches = re.findall(r'(>.{7})*?(aaabbb|bbbaaa)', mystring)
print(matches)
My output looks like this:
[('>CD00192', 'aaabbb'), ('', 'bbbaaa'), ('', 'aaabbb')]
*The second and third items in the list are missing the >CD00192 and >ZP01990, respectively. How can I have the regex include these characters in the list?
Here's a non regular expression approach. Split on ">" (your data will start from 2nd element onwards), then since you don't care what those 7 characters are, so start checking from 8th character onwards till 14th character.
>>> string=""" AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"""
>>> for i in string.split(">")[1:]:
... if i[7:13] in ["aaabbb","bbbaaa"]:
... print ">" + i[:13]
...
>CD00192aaabbb
I have a code that gives also the positions.
Here's the simple version of this code:
import re
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
dic = OrderedDict()
# Finding the result
for mat in regx.finditer(ch):
chunk,head = mat.groups()
headstart = mat.start()
dic[(headstart,head)] = [(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)]
# Diplaying the result
for (headstart,head),li in dic.iteritems():
print '{:>10} {}'.format(headstart,head)
for x in li:
print '{0[0]:>10} {0[1]:>6} {0[2]}'.format(x)
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 8 aaabbb
41 18 bbbaaa
52 29 bbbaaa
62 39 aaabbb
69 ZP01990
95 27 aaabbb
136 SE45789
148 13 aaabbb
172 37 bbbaaa
The same code, in a functional manner, using generators :
import re
from itertools import imap
from collections import OrderedDict
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')
gen = ((mat.groups(),mat.start()) for mat in regx.finditer(ch))
dic = OrderedDict(((headstart,head),
[(headstart+six.start(),six.start(),six.group())
for six in rag.finditer(chunk)])
for (chunk,head),headstart in gen)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]:>6} {0[2]}'.format,li))
for (headstart,head),li in dic.iteritems())
.
I measured the execution's times.
For each code I measured the creation of the dictionary and the displaying separately.
The code using generators (the second) is 7.4 times more rapid to display the result ( 0.020 seconds) than the other one (0.148 seconds)
But surprisingly for me, the code with generators takes 47 % more time (0.000718 seconds) than the other (0.000489 seconds) to compute the dictionary.
.
Another way to do:
import re
from collections import OrderedDict
from itertools import imap
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('((?<=>).{7})|(aaabbb|bbbaaa)')
def collect(ch):
li = []
dic = OrderedDict()
gen = ( (x.start(),x.group(1),x.group(2)) for x in regx.finditer(ch))
for st,g1,g2 in gen:
if g1:
if li:
dic[(stprec,g1prec)] = li
li,stprec,g1prec = [],st,g1
elif g2:
li.append((st,g2))
if li:
dic[(stprec,g1prec)] = li
return dic
dic = collect(ch)
print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
'\n'.join(imap('{0[0]:>10} {0[1]}'.format,li))
for (headstart,head),li in dic.iteritems())
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
24 CD00192
31 aaabbb
41 bbbaaa
52 bbbaaa
62 aaabbb
69 ZP01990
95 aaabbb
136 SE45789
148 aaabbb
172 bbbaaa
This code compute dic in 0.00040 seconds and displays it in 0.0321 seconds
.
To answer to your question, you have no other possibility than keeping each current value among 'CD00192','ZP01990','SE45789' etc under a name (I don't like to say "in a variable" in Python, because there are no variables in Python. But you can read "under a name" as if I had written "in a variable" )
And for that, you must use finditer()
Here's the code for this solution:
import re
ch = '>AB0012xxxxaaaaaaaaaaaa'\
'>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
'>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
'>QD1547zzzzzzzzjjjiii'\
'>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'
print ch,'\n'
regx = re.compile('(>.{7})|(aaabbb|bbbaaa)')
matches = []
for mat in regx.finditer(ch):
g1,g2= mat.groups()
if g1:
head = g1
else:
matches.append((head,g2))
print matches
result
>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa
[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>SE45789', 'aaabbb'), ('>SE45789', 'bbbaaa')]
My preceding codes are more complicated because they catch the positions and gather the values 'aaabbb' and 'bbbaaa' of one header among 'CD00192','ZP01990','SE45789' etc in a list.
zero or more characters can be matched using *
, so a*
would match ""
, "a"
, "aa"
etc. +
matches one or more character.
You will perhaps want to make the quantifier ( +
or *
) lazy by using +?
or *?
as well.
See regular-expressions.info for more details.
Try this:
>>> r1 = re.findall(r'(>.{7})[^>]*?(aaabbb)', s)
>>> r2 = re.findall(r'(>.{7})[^>]*?(bbbaaa)', s)
>>> r1 + r2
[('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'bbbaaa')]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.