[英]Regular Expression Error During Sequence Search
Thanks in advance, I am trying to extract the last few digits of a code which is a taxa id from NCBI.在此先感谢,我正在尝试从 NCBI 中提取作为分类群 ID 的代码的最后几位数字。 I want the bolded numbers from this tag, however these digits are variable in length and value:
我想要这个标签中的粗体数字,但是这些数字的长度和值是可变的:
tag: URS0000D94775_ 60169标签: URS0000D94775_60169
code:代码:
import re
taxID = ()
#strip accession numbers into string
mount = open ('mount.txrt', 'r')
accessions = (re.findall ("URS\S{6}", mount))
for i in accessions:
taxID.append (i)
#parse taxa id's from string
taxas = ()
taxas.append (re.findall ('\_?\d+', taxID))
print ( mount)
Use re.findall
with the regex below:将
re.findall
与下面的正则表达式一起使用:
import re
tag = 'URS0000D94775_60169'
tax_id = re.findall(r'\d+$', tag)[0]
print(tax_id)
# 60169
As you have an optional _
in your pattern and you want to match the digits after URS, you can use由于您的模式中有一个可选的
_
并且您想要匹配 URS 之后的数字,您可以使用
URS(?:.*?\D)?(\d+)$
import re
s = "URS0000D94775_60169"
print(re.findall("URS(?:.*?\D)?(\d+)$", s))
Output Output
60169
If there has to be an underscore present:如果必须存在下划线:
URS[^_]*_(\d+)$
import re
s = "URS0000D94775_60169"
print(re.findall("URS[^_]*_(\d+)$", s))
Output Output
60169
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.