简体   繁体   中英

Regular Expression Error During Sequence Search

Thanks in advance, I am trying to extract the last few digits of a code which is a taxa id from NCBI. I want the bolded numbers from this tag, however these digits are variable in length and value:

tag: URS0000D94775_ 60169

code:

import re
taxID = ()
#strip accession numbers into string
mount = open ('mount.txrt', 'r')
accessions = (re.findall ("URS\S{6}", mount))
     for i in accessions:
     taxID.append (i)
 #parse taxa id's from string        
 taxas = ()
 taxas.append (re.findall ('\_?\d+', taxID)) 
 print ( mount)

Use re.findall with the regex below:

import re
tag = 'URS0000D94775_60169'
tax_id = re.findall(r'\d+$', tag)[0]
print(tax_id)
# 60169

As you have an optional _ in your pattern and you want to match the digits after URS, you can use

URS(?:.*?\D)?(\d+)$

Regex demo

import re
s = "URS0000D94775_60169"
print(re.findall("URS(?:.*?\D)?(\d+)$", s))

Output

60169

If there has to be an underscore present:

URS[^_]*_(\d+)$

Regex demo

import re
s = "URS0000D94775_60169"
print(re.findall("URS[^_]*_(\d+)$", s))

Output

60169

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM