I'm having a bit of trouble with using regular expressions to extract information from flat files (just text). The files are structured as such:
ID (eg >YAL001C)
Annotations/metadata (short phrases describing origin of ID)
Sequence (very long string of characters, eg KRHDE .... ~500 letters on average)
I am trying to extract only IDs and sequences (skip all the metadata). Unfortunately, list operations alone don't suffice, eg
with open("composition.in","rb") as all_info:
all_info=all_info.read()
all_info=all_info.split(">")[1:]
because the metadata/annotation part of the text is littered with '>' characters that cause the list that is generated to be incorrectly structured. List comprehensions get very ugly after a certain point, so I am trying the following:
with open("composition.in","rb") as yeast_all:
yeast_all=yeast_all.read() # convert file to string
## Regular expression to clean up rogue ">" characters
## i.e. "<i>", "<sub>", etc which screw up
## the structure of the eveuntual list
import re
id_delimeter = r'^>{1}+\w{7,10}+\s'
match=re.search(id_delimeter, yeast_all)
if match:
print 'found', match.group()
else:
print 'did not find'
yeast_all=yeast_all.split(id_delimeter)[1:]
I get only an error message saying "error: multiple repeat"
The IDs are of type:
YAL001C
YGR103W
YKL068W-A
The first character is always ">", followed by capital letters and numbers and sometimes dashes (-). I would like a RE that could be used to find all such occurrences and split the text using the RE as a delimeter in order to get IDs and sequences and leave out metadata. I am new to regular expressions so have limited knowledge of the topic!
Note: Only a single newline between each of the three fields (ID, metadata, sequence)
Try
>(?P<id>[\w-]+)\s.*\n(?P<sequence>[\w\n]+)
You'll find the ID in the group id
and the sequence in the group sequence
.
Explanation:
> # start with a ">" character
(?P<id> # capture the ID in group "id"
[\w-]+ # this matches any number (>1) of word characters (A to Z, a to z, digits, and _) or dashes "-"
)
\s+ # after the ID, there must be at least one whitespace character
.* # consume the metadata part, we have no interest in this
\n # up to a newline
(?P<sequence> # finally, capture the sequence data in group "sequence"
[\w\n]+ # this matches any number (>1) of word characters and newlines.
)
As python code:
text= '''>YKL068W-A
foo
ABCD
>XYZ1234
<><><><>><<<>
LMNOP'''
pattern= '>(?P<id>[\w-]+)\n.*\n(?P<sequence>\w+)'
for id, sequence in re.findall(pattern, text):
print((id, sequence))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.