用於解析序列ID的正則表達式

Question

使用正則表達式從平面文件（僅文本）中提取信息時遇到了一些麻煩。 文件的結構如下：

＃

ID（例如> YAL001C）

注釋/元數據（描述ID來源的簡短短語）

序列（很長的字符串，例如KRHDE ....平均〜500個字母）

＃

我試圖僅提取ID和序列（跳過所有元數據）。 不幸的是，僅列表操作是不夠的，例如

with open("composition.in","rb") as all_info:
    all_info=all_info.read() 
    all_info=all_info.split(">")[1:]

因為文本的元數據/注釋部分用'>'字符填充，這會導致生成的列表的結構不正確。 列表理解在某一點之后變得非常難看，所以我嘗試以下操作：

with open("composition.in","rb") as yeast_all:
yeast_all=yeast_all.read() # convert file to string

## Regular expression to clean up rogue ">" characters
## i.e. "<i>", "<sub>", etc which screw up
## the structure of the eveuntual list
import re
id_delimeter = r'^>{1}+\w{7,10}+\s' 
match=re.search(id_delimeter, yeast_all)
if match:
    print 'found', match.group()
else:
    print 'did not find'        
yeast_all=yeast_all.split(id_delimeter)[1:]

我僅收到一條錯誤消息，提示“錯誤：多次重復”

ID的類型為：

YAL001C

YGR103W

YKL068W-A

第一個字符始終是“>”，后跟大寫字母和數字，有時還包括破折號（-）。 我想要一個可用於查找所有此類事件的RE，並使用RE作為分隔符來分割文本，以獲取ID和序列並忽略元數據。 我是正則表達式的新手，因此對該主題的了解有限！

注意：三個字段（ID，元數據，序列）之間只有一個換行符

Answer 1

嘗試

>(?P<id>[\w-]+)\s.*\n(?P<sequence>[\w\n]+)

你會發現在該組的ID id ，並在該組的序列sequence 。

演示

說明：

> # start with a ">" character
(?P<id> # capture the ID in group "id"
    [\w-]+ # this matches any number (>1) of word characters (A to Z, a to z, digits, and _) or dashes "-"
)
\s+ # after the ID, there must be at least one whitespace character
.* # consume the metadata part, we have no interest in this
\n # up to a newline
(?P<sequence> # finally, capture the sequence data in group "sequence"
    [\w\n]+ # this matches any number (>1) of word characters and newlines.
)

作為python代碼：

text= '''>YKL068W-A
foo
ABCD

>XYZ1234
<><><><>><<<>
LMNOP'''

pattern= '>(?P<id>[\w-]+)\n.*\n(?P<sequence>\w+)'

for id, sequence in re.findall(pattern, text):
    print((id, sequence))

用於解析序列ID的正則表達式

問題描述

＃

＃

1 個解決方案

解決方案1
0 已采納 2014-10-02 19:47:50

用於解析序列ID的正則表達式

問題描述

＃

＃

1 個解決方案

解決方案1 0 已采納 2014-10-02 19:47:50

解決方案1
0 已采納 2014-10-02 19:47:50