序列搜索期間的正則表達式錯誤

Question

在此先感謝，我正在嘗試從 NCBI 中提取作為分類群 ID 的代碼的最后幾位數字。 我想要這個標簽中的粗體數字，但是這些數字的長度和值是可變的：

標簽： URS0000D94775_60169

代碼：

import re
taxID = ()
#strip accession numbers into string
mount = open ('mount.txrt', 'r')
accessions = (re.findall ("URS\S{6}", mount))
     for i in accessions:
     taxID.append (i)
 #parse taxa id's from string        
 taxas = ()
 taxas.append (re.findall ('\_?\d+', taxID)) 
 print ( mount)

Answer 1

將re.findall與下面的正則表達式一起使用：

import re
tag = 'URS0000D94775_60169'
tax_id = re.findall(r'\d+$', tag)[0]
print(tax_id)
# 60169

Answer 2

由於您的模式中有一個可選的_並且您想要匹配 URS 之后的數字，您可以使用

URS(?:.*?\D)?(\d+)$

正則表達式演示

import re
s = "URS0000D94775_60169"
print(re.findall("URS(?:.*?\D)?(\d+)$", s))

Output

如果必須存在下划線：

URS[^_]*_(\d+)$

正則表達式演示

import re
s = "URS0000D94775_60169"
print(re.findall("URS[^_]*_(\d+)$", s))

Output

序列搜索期間的正則表達式錯誤

問題描述

2 個解決方案

解決方案1
1 2021-02-13 03:05:52

解決方案2
1 2021-02-14 14:42:26

序列搜索期間的正則表達式錯誤

問題描述

2 個解決方案

解決方案1 1 2021-02-13 03:05:52

解決方案2 1 2021-02-14 14:42:26

解決方案1
1 2021-02-13 03:05:52

解決方案2
1 2021-02-14 14:42:26