[英]Regex to extract substring from string in Python
我在兩列中有以下值。 我只想從“名稱”列中獲取選定的值。
gene
BCR-ABL (translocation) [HSA:25] [KO:K06619] MLL-AF4 (translocation) [HSA:4297 4299] [KO:K09186 K15184] E2A-PBX1 (translocation) [HSA:6929 5087] [KO:K09063 K09355] TEL-AML1 (translocation) [HSA:861] [KO:K08367] c-MYC (rearrangement) [HSA:4609] [KO:K04377] CRLF2 (rearrangement) [HSA:64109] [KO:K05078] PAX5 (rearrangement) [HSA:5079] [KO:K09383]
(GALAC1) GALT [HSA:2592] [KO:K00965] (GALAC2) GALK1 [HSA:2584] [KO:K00849] (GALAC3) GALE [HSA:2582] [KO:K01784] (GALAC4) GALM [HSA:130589] [KO:K01785]
我在 python 中使用以下正則表達式來提取它並獲得以下輸出 dict['GENE'] 具有這些值。
pattern1= re.compile('^(.*) \(.* \[HSA')
for gene in re.findall(pattern1, dict['GENE']):
re.sub("\(.*?\)|\[.*?\]\s+", ' | ', gene)
1 BCR-ABL | ||MLL-AF4 | ||E2A-PBX1 | ||TEL-AML1 | ||c-MYC | ||CRLF2 | ||PAX5
2 | GALT ||| GALK1 ||| GALE ||
所需的輸出是:
1 BCR-ABL | MLL-AF4 | E2A-PBX1 | TEL-AML1 | c-MYC | CRLF2 | PAX5
2 GALT | GALK1 | GALE | GALM
笨重的方法,但它返回您想要的輸出
import re
s = '''BCR-ABL (translocation) [HSA:25] [KO:K06619] MLL-AF4 (translocation) [HSA:4297 4299] [KO:K09186 K15184] E2A-PBX1 (translocation) [HSA:6929 5087] [KO:K09063 K09355] TEL-AML1 (translocation) [HSA:861] [KO:K08367] c-MYC (rearrangement) [HSA:4609] [KO:K04377] CRLF2 (rearrangement) [HSA:64109] [KO:K05078] PAX5 (rearrangement) [HSA:5079] [KO:K09383]
(GALAC1) GALT [HSA:2592] [KO:K00965] (GALAC2) GALK1 [HSA:2584] [KO:K00849] (GALAC3) GALE [HSA:2582] [KO:K01784] (GALAC4) GALM [HSA:130589] [KO:K01785]'''
s = s.split('\n')
for line in s:
line = re.sub(r'\([^\)]+\)', '', line)
line = re.sub(r'\[[^\]]+\]', '', line)
r = re.sub(r'\s{2,}', ' | ', line)
print(r.strip().strip('|'))
看起來你主要想去掉括號之間的文本:
>>> nobrackets = re.sub('(\[|\().*?(\]|\))', '', txt)
>>> print(nobrackets)
gene
BCR-ABL MLL-AF4 E2A-PBX1 TEL-AML1 c-MYC CRLF2 PAX5
GALT GALK1 GALE GALM
正則表達式非常簡單:
(
\[ # a literal [
| # or
\( # a literal (
)
.*? # anything (ungreedy¹)
(
\] # a literal ]
| # or
\) # a literal )
)
1: https ://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy
然后,這只是清理和格式化的問題:
>>> lines = [ ' | '.join(filter(lambda x: x, re.split('\s+', line))) for line in nobrackets.split('\n') ]
>>> for i, line in enumerate(lines):
... print(f'{i} {line}')
...
0 gene
1 BCR-ABL | MLL-AF4 | E2A-PBX1 | TEL-AML1 | c-MYC | CRLF2 | PAX5
2 GALT | GALK1 | GALE | GALM
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.