简体   繁体   中英

python regex- getting everything (except \n) between two characters in a multiline string

I have file like this as input:

>X0
CUUGACGAUCA
CGCAUCG
>X55
UACGGCGG
UUCAGC
AUCG
>X300
AAACCCGGGG

and I need to get the concatenation of lines between '>' characters:

CUUGACGAUCACGCAUCG
UACGGCGGUUCAGCAUCG
AAACCCGGGG

My attempt was to use "re.match(r'^>.*\\n(.*)>.*' ,a,re.DOTALL)" and then delete '\\n' from each match, but the regex is not returning anything. Where am I wrong?

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski

That being said, why not do this much more understandable string processing?

tmp = []
seqs = []
with open('txtfile') as f:
    for line in f:
        if line.startswith('>'):
            seqs.append(''.join(tmp))
            tmp = []
        else:
            tmp.append(line.strip())
    else:
        seqs.pop(0)
        seqs.append(''.join(tmp))

Alternatively, if you really want to use a regex, you could try first stripping the newlines and then splitting by the >X[digit] patterns:

re.split(r'>X\d+', re.sub(r'\n', '', data))

But that has the downside that the entire textfile has to be loaded into the variable data , which is not as interesting for large file (which in bio-informatics are quite common). So even then, the approach given first is more interesting, memory-wise, as you could process each finished DNA/RNA-sequence in turn.

I would have simply done:

s = file.read()    #or whatever string yu have
sar = "".join(s.split())   #this will remove newlines
sar = sar.split('>')   #processing your splitter 
for tstr in sar:
    print tstr #this is the concatenation of lines between '>' characters: 

A regex will work well for this application, but to do this with a regex , you need to use a lookahead assertion . This essentially means that the regex looks for, but does not consume, what's defined within the lookahead (?=...) , where ... is for what you're looking ahead.

So, incorporating this into a full pattern, you would get this:

>(.+?)(?=>|$)

Distilling this, this pattern looks for a > as the starting point, and then captures everything up to the point where it sees either another > or the end of the string, but -- and this is key -- it doesn't consume the ending > , so it's available to start the next instance.

You'll also need to use the DOTALL flag to ensure newlines match the . and the findall function to return all matches.

So, something like this will work:

#!/usr/env/python

import re

string = """>X0
CUUGACGAUCA
CGCAUCG
>X55
UACGGCGG
UUCAGC
AUCG
>X300
AAACCCGGGG"""

res = re.findall('>(.+?)(?=>|$)', string, re.DOTALL)

print "results: {0}".format(res)

The output is:

results: ['X0\nCUUGACGAUCA\nCGCAUCG\n', 'X55\nUACGGCGG\nUUCAGC\nAUCG\n', 'X300\nAAACCCGGGG']

See the Python regex doc for more regex details.

If you don't want the newlines in the result, you can then use string.replace to remove those from each item in the list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM