使用正则表达式提取字符串

Question

I have the following strings:我有以下字符串：

LOW QUALITY PROTEIN: cysteine proteinase 5-like [Solanum pennellii]低质量蛋白质：半胱氨酸蛋白酶 5 样 [Solanum pennellii]
PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]预测：低质量蛋白质：未鉴定的蛋白质 LOC107059219 [Solanum pennellii]
XP_019244624.1 PREDICTED: peroxidase 40-like [Nicotiana attenuata] XP_019244624.1 预测：过氧化物酶 40 样 [Nicotiana attenuata]
RVW92024.1 Retrovirus-related Pol polyprotein from transposon TNT 1-94 [Vitis vinifera] RVW92024.1 来自转座子 TNT 1-94 [Vitis vinifera] 的逆转录病毒相关 Pol 多蛋白
hypothetical protein VITISV_035070 [Vitis vinifera]假设蛋白 VITISV_035070 [Vitis vinifera]

How to extract the below strings from the above strings?如何从上面的字符串中提取下面的字符串？

cysteine proteinase 5-like半胱氨酸蛋白酶5样
uncharacterized protein LOC107059219未表征的蛋白质 LOC107059219
peroxidase 40-like过氧化物酶40样
Retrovirus-related Pol polyprotein from transposon TNT 1-94来自转座子 TNT 1-94 的逆转录病毒相关 Pol 多蛋白
hypothetical protein VITISV_035070假设蛋白质 VITISV_035070

Answer 1

s = '''LOW QUALITY PROTEIN: cysteine proteinase 5-like  [Solanum pennellii]
PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]
XP_019244624.1 PREDICTED: peroxidase 40-like [Nicotiana attenuata]
RVW92024.1 Retrovirus-related Pol polyprotein from transposon TNT 1-94 [Vitis vinifera]
hypothetical protein VITISV_035070 [Vitis vinifera]'''

import re
rgx = '(:?)\s([\w\s-]+)\s(\[.+\])'

list1 = []
for m in re.findall(rgx, s):
    list1.append(m[1])

print(list1)

Output Output

['cysteine proteinase 5-like ',
 'uncharacterized protein LOC107059219',
 'peroxidase 40-like',
 'Retrovirus-related Pol polyprotein from transposon TNT 1-94',
 'hypothetical protein VITISV_035070']

Look up https://regex101.com/r/HATKMa/1 for the explanation in detail.查看https://regex101.com/r/HATKMa/1了解详细说明。

Answer 2

I think this problem don't need regex.我认为这个问题不需要正则表达式。 I would prefer following solution because it is easy to understand我更喜欢以下解决方案，因为它很容易理解

st = "PREDICTED: LOW QUALITY PROTEIN: uncharacterized protein LOC107059219 [Solanum pennellii]"
st.split(":")[-1].split("[")[0].strip()

使用正则表达式提取字符串

问题描述

2 个解决方案

解决方案1
0 2019-09-20 01:09:50

解决方案2
0 2019-09-20 01:46:23

使用正则表达式提取字符串

问题描述

2 个解决方案

解决方案1 0 2019-09-20 01:09:50

解决方案2 0 2019-09-20 01:46:23

解决方案1
0 2019-09-20 01:09:50

解决方案2
0 2019-09-20 01:46:23