简体   繁体   中英

In a list of strings, find a phrase within the string and append two integers (x..y) in string to a list . Python

So i'm trying to write a script that reads a file and extracts 2 values once a certain word is found. In this case, when the string 'exon' is encountered it will save the two integers that follow it.

I started by creating empty strings:

exon_start = []
exon_end = []

Here is an example of the simplified data I am using:

for line in data:
    print data

>>>

 exon            1..35
                 /gene="CDKN1A"

 CDS             73..567
                 /translation="MSEPAGDVRQNPCGSKACRRLFGPVDSEQLSRDCDALMAGCIQE
                 ARERWNFDFVTETPLEGDFAWERVRGLGLPKLYLPTGPRRGRDELGGGRRPGTSPALL
                 QGTAEEDHVDLSLSCTLVPRSGEQAEGSPGGPGDSQGRKRRQTSMTDFYHSKRRLIFS
                 KRKP"

 misc_feature    76..78
                 /gene="CDKN1A"


 exon            518..2106
                 /gene="CDKN1A"

I tried importing regular expression module for the re.findall() function:

indx_exon = range(0,len(data))

# so this relates each line of the data to a specific number in the index

i'm having trouble recognizing the 'exon' phrase within each individual line first i just tried to identify which line of the text had the exon sequence to see if the re.findall() was working and I put:

for p,line in zip(indx_line,data):

    if re.findall(r'exon',line) is True:
        print p

and I got None

when I put:

for p,line in zip(indx_line,data):

    exon_test = re.findall(r'exon',line)
    print exon_test

i got a bunch of [] for the lines that did not contain 'exon' and for the lines the did contain 'exon' they gave me 'exon' . so i know that i can use the re.findall() feature to find every occurence of 'exon' within each of the strings

i just need to find out exactly how i can say when it finds the 'exon' it needs to look in that line until it finds '..' and then append the integers flanking it to their corresponding lists ; ie

exon_start = [1,518]
exon_end = [35,2106]

The problem is in if re.findall(r'exon',line) is True: line. Because re.finall() will not return True or False . Example:

>>> mystr = '123 exon'
>>> import re
>>> re.findall(r'exon', mystr)
['exon']
>>> re.findall(r'exon', mystr) is True
False
>>> bool(re.findall(r'exon',mystr))
True
>>> if re.findall(r'exon', mystr):
...     print 'true'
... 
true

Change the original code to:

for p,line in zip(indx_line,data):

    if re.findall(r'exon',line):
        print p

should make it work.


Edit: As @TimPietzcker pointed out, you don't need to use re at all for this case. And to address your second question of getting the number flanking .. , here is the code that could be helpful:

>>> line = ' exon            1..35'
>>> if 'exon' in line:
...     ranges = line.split()[1].split('..')
...     print ranges
...
['1', '35']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM