通过正则表达式过滤txt文件的信息

Question

I have a file with information, this is how it looks like: 我有一个信息文件，这是它的样子：

****ALIGNMENT****
Sequence:  gi|86755972|gb|ABD15130.1| cold acclimation protein COR413-PM1 [Chimonanthus praecox]
Length:  201
E-value:  2.66576e-82
KYLAMKTDQLAVANMIDSDINELKMATMRLINDASMLGHYGFGTHFLKWLACLAAIYLLILDRTNWRTNMLTSLL...
+YLAMKTD+ +   +I +D+ E+  A  +L+ DA+ LG  G GT  LKW+A  AAIYLLILDRTNW+TNMLT+LL...
EYLAMKTDEWSAQQLIQTDLKEMGKAAKKLVYDATKLGSLGVGTSILKWVASFAAIYLLILDRTNWKTNMLTALL...

Now I want to filter some information, and I want to use it as a variable. 现在我想过滤一些信息，我想将它用作变量。 I think I should use a regular expression for this, but I don't know how to do that with lots of information of the second line, for example. 我想我应该使用正则表达式，但我不知道如何使用第二行的大量信息来做到这一点，例如。

I need the hitsid , protein , organism , and evalue . 我需要hitsid ， protein ， organism和evalue 。

The corresponding data: 相应的数据：

hitsid = 86755972
protein = cold acclimation protein COR413-PM1
organism = Chimonanthus praecox
evalue = 2.66576e-82

So I want that, when I ask for the hitsid , that Python prints ' 86755972 '. 所以我想要的是，当我要求hitsid时，Python打印' 86755972 '。

Could anyone help me with this? 任何人都可以帮我这个吗？ Thanks! 谢谢！

Answer 1

Use a regex like 使用正则表达式

^Sequence:[^|]*\|(?P<hitsid>[^|]*)\|\S*\s*(?P<protein>[^][]*?)\s*\[(?P<organism>[^][]*)][\s\S]*?\nE-value:\s*(?P<evalue>.*)

See the regex demo 请参阅正则表达式演示

A sample Python code getting multiple values into a list of dictionaries: 一个示例Python代码将多个值添加到字典列表中：

import re
p = re.compile(r'^Sequence:[^|]*\|(?P<hitsid>[^|]*)\|\S*\s*(?P<protein>[^][]*?)\s*\[(?P<organism>[^][]*)][\s\S]*?\nE-value:\s*(?P<evalue>.*)', re.MULTILINE)
s = "****ALIGNMENT****\nSequence:  gi|86755972|gb|ABD15130.1| cold acclimation protein COR413-PM1 [Chimonanthus praecox]\nLength:  201\nE-value:  2.66576e-82\nKYLAMKTDQLAVANMIDSDINELKMATMRLINDASMLGHYGFGTHFLKWLACLAAIYLLILDRTNWRTNMLTSLL...\n+YLAMKTD+ +   +I +D+ E+  A  +L+ DA+ LG  G GT  LKW+A  AAIYLLILDRTNW+TNMLT+LL...\nEYLAMKTDEWSAQQLIQTDLKEMGKAAKKLVYDATKLGSLGVGTSILKWVASFAAIYLLILDRTNWKTNMLTALL..."
res = [m.groupdict() for m in p.finditer(s)]
for x in res:
    print(x['hitsid'])
    print(x['protein'])
    print(x['organism'])
    print(x['evalue'])

通过正则表达式过滤txt文件的信息

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-05-28 23:15:20

通过正则表达式过滤txt文件的信息

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-05-28 23:15:20

解决方案1
0 已采纳 2016-05-28 23:15:20