用于解析序列ID的正则表达式

Question

I'm having a bit of trouble with using regular expressions to extract information from flat files (just text). 使用正则表达式从平面文件（仅文本）中提取信息时遇到了一些麻烦。 The files are structured as such: 文件的结构如下：

# ＃

ID (eg >YAL001C) ID（例如> YAL001C）

Annotations/metadata (short phrases describing origin of ID) 注释/元数据（描述ID来源的简短短语）

Sequence (very long string of characters, eg KRHDE .... ~500 letters on average) 序列（很长的字符串，例如KRHDE ....平均〜500个字母）

# ＃

I am trying to extract only IDs and sequences (skip all the metadata). 我试图仅提取ID和序列（跳过所有元数据）。 Unfortunately, list operations alone don't suffice, eg 不幸的是，仅列表操作是不够的，例如

with open("composition.in","rb") as all_info:
    all_info=all_info.read() 
    all_info=all_info.split(">")[1:]

because the metadata/annotation part of the text is littered with '>' characters that cause the list that is generated to be incorrectly structured. 因为文本的元数据/注释部分用'>'字符填充，这会导致生成的列表的结构不正确。 List comprehensions get very ugly after a certain point, so I am trying the following: 列表理解在某一点之后变得非常难看，所以我尝试以下操作：

with open("composition.in","rb") as yeast_all:
yeast_all=yeast_all.read() # convert file to string

## Regular expression to clean up rogue ">" characters
## i.e. "<i>", "<sub>", etc which screw up
## the structure of the eveuntual list
import re
id_delimeter = r'^>{1}+\w{7,10}+\s' 
match=re.search(id_delimeter, yeast_all)
if match:
    print 'found', match.group()
else:
    print 'did not find'        
yeast_all=yeast_all.split(id_delimeter)[1:]

I get only an error message saying "error: multiple repeat" 我仅收到一条错误消息，提示“错误：多次重复”

The IDs are of type: ID的类型为：

YAL001C YAL001C

YGR103W YGR103W

YKL068W-A YKL068W-A

The first character is always ">", followed by capital letters and numbers and sometimes dashes (-). 第一个字符始终是“>”，后跟大写字母和数字，有时还包括破折号（-）。 I would like a RE that could be used to find all such occurrences and split the text using the RE as a delimeter in order to get IDs and sequences and leave out metadata. 我想要一个可用于查找所有此类事件的RE，并使用RE作为分隔符来分割文本，以获取ID和序列并忽略元数据。 I am new to regular expressions so have limited knowledge of the topic! 我是正则表达式的新手，因此对该主题的了解有限！

Note: Only a single newline between each of the three fields (ID, metadata, sequence) 注意：三个字段（ID，元数据，序列）之间只有一个换行符

Answer 1

Try 尝试

>(?P<id>[\w-]+)\s.*\n(?P<sequence>[\w\n]+)

You'll find the ID in the group id and the sequence in the group sequence . 你会发现在该组的ID id ，并在该组的序列sequence 。

Demo. 演示

Explanation: 说明：

> # start with a ">" character
(?P<id> # capture the ID in group "id"
    [\w-]+ # this matches any number (>1) of word characters (A to Z, a to z, digits, and _) or dashes "-"
)
\s+ # after the ID, there must be at least one whitespace character
.* # consume the metadata part, we have no interest in this
\n # up to a newline
(?P<sequence> # finally, capture the sequence data in group "sequence"
    [\w\n]+ # this matches any number (>1) of word characters and newlines.
)

As python code: 作为python代码：

text= '''>YKL068W-A
foo
ABCD

>XYZ1234
<><><><>><<<>
LMNOP'''

pattern= '>(?P<id>[\w-]+)\n.*\n(?P<sequence>\w+)'

for id, sequence in re.findall(pattern, text):
    print((id, sequence))

用于解析序列ID的正则表达式

问题描述

# ＃

# ＃

1 个解决方案

解决方案1
0 已采纳 2014-10-02 19:47:50

用于解析序列ID的正则表达式

问题描述

# ＃

# ＃

1 个解决方案

解决方案1 0 已采纳 2014-10-02 19:47:50

解决方案1
0 已采纳 2014-10-02 19:47:50