簡體   English   中英

正則表達式從 Python 中的字符串中提取子字符串

[英]Regex to extract substring from string in Python

我在兩列中有以下值。 我只想從“名稱”列中獲取選定的值。

gene 
BCR-ABL (translocation) [HSA:25] [KO:K06619]            MLL-AF4 (translocation) [HSA:4297   4299] [KO:K09186 K15184]            E2A-PBX1 (translocation) [HSA:6929 5087] [KO:K09063 K09355]            TEL-AML1 (translocation) [HSA:861] [KO:K08367]            c-MYC (rearrangement) [HSA:4609] [KO:K04377]            CRLF2 (rearrangement) [HSA:64109] [KO:K05078]            PAX5 (rearrangement) [HSA:5079] [KO:K09383]
(GALAC1) GALT [HSA:2592] [KO:K00965]            (GALAC2) GALK1 [HSA:2584] [KO:K00849]            (GALAC3) GALE [HSA:2582] [KO:K01784]            (GALAC4) GALM [HSA:130589] [KO:K01785]

我在 python 中使用以下正則表達式來提取它並獲得以下輸出 dict['GENE'] 具有這些值。

pattern1= re.compile('^(.*) \(.* \[HSA')
for gene in re.findall(pattern1, dict['GENE']):
    re.sub("\(.*?\)|\[.*?\]\s+", ' | ', gene)
1 BCR-ABL | ||MLL-AF4 | ||E2A-PBX1 | ||TEL-AML1 | ||c-MYC | ||CRLF2 | ||PAX5
2 | GALT ||| GALK1 ||| GALE ||

所需的輸出是:

1 BCR-ABL | MLL-AF4 | E2A-PBX1 | TEL-AML1 | c-MYC | CRLF2 | PAX5
2 GALT | GALK1 | GALE | GALM

笨重的方法,但它返回您想要的輸出

import re

s = '''BCR-ABL (translocation) [HSA:25] [KO:K06619]            MLL-AF4 (translocation) [HSA:4297   4299] [KO:K09186 K15184]            E2A-PBX1 (translocation) [HSA:6929 5087] [KO:K09063 K09355]            TEL-AML1 (translocation) [HSA:861] [KO:K08367]            c-MYC (rearrangement) [HSA:4609] [KO:K04377]            CRLF2 (rearrangement) [HSA:64109] [KO:K05078]            PAX5 (rearrangement) [HSA:5079] [KO:K09383]
(GALAC1) GALT [HSA:2592] [KO:K00965]            (GALAC2) GALK1 [HSA:2584] [KO:K00849]            (GALAC3) GALE [HSA:2582] [KO:K01784]            (GALAC4) GALM [HSA:130589] [KO:K01785]'''
s = s.split('\n')

for line in s:
    line = re.sub(r'\([^\)]+\)', '', line)
    line = re.sub(r'\[[^\]]+\]', '', line)
    r = re.sub(r'\s{2,}', ' | ', line)
    print(r.strip().strip('|'))

看起來你主要想去掉括號之間的文本:

>>> nobrackets = re.sub('(\[|\().*?(\]|\))', '', txt)
>>> print(nobrackets)
gene 
BCR-ABL               MLL-AF4               E2A-PBX1               TEL-AML1               c-MYC               CRLF2               PAX5   
 GALT               GALK1               GALE               GALM  

正則表達式非常簡單:

(
  \[    # a literal [
  |     # or
  \(    # a literal (
)
.*?     # anything (ungreedy¹)
(
  \]    # a literal ]
  |     # or
  \)    # a literal )
)    

1: https ://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy

然后,這只是清理和格式化的問題:

>>> lines = [ ' | '.join(filter(lambda x: x, re.split('\s+', line))) for line in nobrackets.split('\n') ]
>>> for i, line in enumerate(lines):
...   print(f'{i} {line}')
... 
0 gene
1 BCR-ABL | MLL-AF4 | E2A-PBX1 | TEL-AML1 | c-MYC | CRLF2 | PAX5
2 GALT | GALK1 | GALE | GALM

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM