So i'm trying to write a script that reads a file and extracts 2 values once a certain word is found. In this case, when the string 'exon' is encountered it will save the two integers that follow it.
I started by creating empty strings:
exon_start = []
exon_end = []
Here is an example of the simplified data I am using:
for line in data:
print data
>>>
exon 1..35
/gene="CDKN1A"
CDS 73..567
/translation="MSEPAGDVRQNPCGSKACRRLFGPVDSEQLSRDCDALMAGCIQE
ARERWNFDFVTETPLEGDFAWERVRGLGLPKLYLPTGPRRGRDELGGGRRPGTSPALL
QGTAEEDHVDLSLSCTLVPRSGEQAEGSPGGPGDSQGRKRRQTSMTDFYHSKRRLIFS
KRKP"
misc_feature 76..78
/gene="CDKN1A"
exon 518..2106
/gene="CDKN1A"
I tried importing regular expression module for the re.findall() function:
indx_exon = range(0,len(data))
# so this relates each line of the data to a specific number in the index
i'm having trouble recognizing the 'exon' phrase within each individual line first i just tried to identify which line of the text had the exon sequence to see if the re.findall() was working and I put:
for p,line in zip(indx_line,data):
if re.findall(r'exon',line) is True:
print p
and I got None
when I put:
for p,line in zip(indx_line,data):
exon_test = re.findall(r'exon',line)
print exon_test
i got a bunch of [] for the lines that did not contain 'exon' and for the lines the did contain 'exon' they gave me 'exon' . so i know that i can use the re.findall() feature to find every occurence of 'exon' within each of the strings
i just need to find out exactly how i can say when it finds the 'exon' it needs to look in that line until it finds '..' and then append the integers flanking it to their corresponding lists ; ie
exon_start = [1,518]
exon_end = [35,2106]
The problem is in if re.findall(r'exon',line) is True:
line. Because re.finall()
will not return True
or False
. Example:
>>> mystr = '123 exon'
>>> import re
>>> re.findall(r'exon', mystr)
['exon']
>>> re.findall(r'exon', mystr) is True
False
>>> bool(re.findall(r'exon',mystr))
True
>>> if re.findall(r'exon', mystr):
... print 'true'
...
true
Change the original code to:
for p,line in zip(indx_line,data):
if re.findall(r'exon',line):
print p
should make it work.
Edit: As @TimPietzcker pointed out, you don't need to use re
at all for this case. And to address your second question of getting the number flanking ..
, here is the code that could be helpful:
>>> line = ' exon 1..35'
>>> if 'exon' in line:
... ranges = line.split()[1].split('..')
... print ranges
...
['1', '35']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.