繁体   English   中英

使用正则表达式从文本中提取类别

[英]Extract categories from text using regular expression

我是在python中使用正则表达式的新手。 我在弄清楚如何执行以下操作时遇到了麻烦:

我有一堆像字符串这样的文本描述,如下所示:

FX0XST001ALF89  OLIGO: Bacillus_cand1=ATGCGGTTCAAAATGTTATC      
FILE:/home/AAFC-AAC/fungs/biodiversity/pipelines/454PipelineOutput/v7_newest_testrun_full/rs75/plate1/FX0XST001.MID13/FX0XST001.MID13.sff.trim.fasta    
Project: SAGES  SFF: FX0XST001  SFF.MID: FX0XST001.MID13    
Plate: 1.1     MID_all: MID13   MID: 13 Sample: BK104   
Collector: BK   Year: 2008  Week:   Year_Week:  
Location: Ottawa_ON     City: Ottawa    Province: ON    Crop:   
Treatment:    Substrate_all: Air    Substrate: Air  Target: Bacteria    
Forward Primer: Bac16S27F   Reverse Primer: Bac16S690R  Taq: T

我希望能够提取此大字符串中的类别并将其存储到数据库中,例如:

Year: 2008
Sample: BK104
Collector: BK

etc...

如何在python中使用正则表达式来实现这一目标?

我正在考虑使用搜索:

match = re.search(r'Sample:\w\w\w\w\w', theTextDescription)

问题在于每个“字段”中文本的长度不同。 我真的不知道该如何考虑

像这样,您可以使用\\w+来匹配任意长度的字符:

In [37]: strs
Out[37]: 'FX0XST001ALF89  OLIGO: Bacillus_cand1=ATGCGGTTCAAAATGTTATC      \nFILE:/home/AAFC-AAC/fungs/biodiversity/pipelines/454PipelineOutput/v7_newest_testrun_full/rs75/plate1/FX0XST001.MID13/FX0XST001.MID13.sff.trim.fasta    \nProject: SAGES  SFF: FX0XST001  SFF.MID: FX0XST001.MID13    \nPlate: 1.1     MID_all: MID13   MID: 13 Sample: BK104   \nCollector: BK   Year: 2008  Week:   Year_Week:  \nLocation: Ottawa_ON     City: Ottawa    Province: ON    Crop:   \nTreatment:    Substrate_all: Air    Substrate: Air  Target: Bacteria    \nForward Primer: Bac16S27F   Reverse Primer: Bac16S690R  Taq: T'

In [38]: re.findall(r"\w+:\s\w+",strs)
Out[38]: 
['OLIGO: Bacillus_cand1',
 'Project: SAGES',
 'SFF: FX0XST001',
 'MID: FX0XST001',
 'Plate: 1',
 'MID_all: MID13',
 'MID: 13',
 'Sample: BK104',
 'Collector: BK',
 'Year: 2008',
 'Location: Ottawa_ON',
 'City: Ottawa',
 'Province: ON',
 'Substrate_all: Air',
 'Substrate: Air',
 'Target: Bacteria',
 'Primer: Bac16S27F',
 'Primer: Bac16S690R',
 'Taq: T']

或者可以将其存储在字典中:

In [39]: dict(x.split(":") for x in  re.findall(r"\w+:\s\w+",strs))
Out[39]: 
{'City': ' Ottawa',
 'Collector': ' BK',
 'Location': ' Ottawa_ON',
 'MID': ' 13',
 'MID_all': ' MID13',
 'OLIGO': ' Bacillus_cand1',
 'Plate': ' 1',
 'Primer': ' Bac16S690R',
 'Project': ' SAGES',
 'Province': ' ON',
 'SFF': ' FX0XST001',
 'Sample': ' BK104',
 'Substrate': ' Air',
 'Substrate_all': ' Air',
 'Taq': ' T',
 'Target': ' Bacteria',
 'Year': ' 2008'}

利用正则表达式语言的量词:

? = 0或1

* = 0或更大

+ = 1或更多

match = re.search(r'Sample:\s\w+', theTextDescription)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM