简体   繁体   中英

extracting strings using regular expression

I have the following strings:

  1. LOW QUALITY PROTEIN: cysteine proteinase 5-like [Solanum pennellii]
  2. PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]
  3. XP_019244624.1 PREDICTED: peroxidase 40-like [Nicotiana attenuata]
  4. RVW92024.1 Retrovirus-related Pol polyprotein from transposon TNT 1-94 [Vitis vinifera]
  5. hypothetical protein VITISV_035070 [Vitis vinifera]

How to extract the below strings from the above strings?

  1. cysteine proteinase 5-like
  2. uncharacterized protein LOC107059219
  3. peroxidase 40-like
  4. Retrovirus-related Pol polyprotein from transposon TNT 1-94
  5. hypothetical protein VITISV_035070
s = '''LOW QUALITY PROTEIN: cysteine proteinase 5-like  [Solanum pennellii]
PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]
XP_019244624.1 PREDICTED: peroxidase 40-like [Nicotiana attenuata]
RVW92024.1 Retrovirus-related Pol polyprotein from transposon TNT 1-94 [Vitis vinifera]
hypothetical protein VITISV_035070 [Vitis vinifera]'''

import re
rgx = '(:?)\s([\w\s-]+)\s(\[.+\])'

list1 = []
for m in re.findall(rgx, s):
    list1.append(m[1])

print(list1)

Output

['cysteine proteinase 5-like ',
 'uncharacterized protein LOC107059219',
 'peroxidase 40-like',
 'Retrovirus-related Pol polyprotein from transposon TNT 1-94',
 'hypothetical protein VITISV_035070']

Look up https://regex101.com/r/HATKMa/1 for the explanation in detail.

I think this problem don't need regex. I would prefer following solution because it is easy to understand

st = "PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]"
st.split(":")[-1].split("[")[0].strip()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM