[英]Filter information of a txt file by regular expressions
I have a file with information, this is how it looks like: 我有一个信息文件,这是它的样子:
****ALIGNMENT****
Sequence: gi|86755972|gb|ABD15130.1| cold acclimation protein COR413-PM1 [Chimonanthus praecox]
Length: 201
E-value: 2.66576e-82
KYLAMKTDQLAVANMIDSDINELKMATMRLINDASMLGHYGFGTHFLKWLACLAAIYLLILDRTNWRTNMLTSLL...
+YLAMKTD+ + +I +D+ E+ A +L+ DA+ LG G GT LKW+A AAIYLLILDRTNW+TNMLT+LL...
EYLAMKTDEWSAQQLIQTDLKEMGKAAKKLVYDATKLGSLGVGTSILKWVASFAAIYLLILDRTNWKTNMLTALL...
Now I want to filter some information, and I want to use it as a variable. 现在我想过滤一些信息,我想将它用作变量。 I think I should use a regular expression for this, but I don't know how to do that with lots of information of the second line, for example.
我想我应该使用正则表达式,但我不知道如何使用第二行的大量信息来做到这一点,例如。
I need the hitsid
, protein
, organism
, and evalue
. 我需要
hitsid
, protein
, organism
和evalue
。
The corresponding data: 相应的数据:
hitsid = 86755972
protein = cold acclimation protein COR413-PM1
organism = Chimonanthus praecox
evalue = 2.66576e-82
So I want that, when I ask for the hitsid
, that Python prints ' 86755972
'. 所以我想要的是,当我要求
hitsid
时,Python打印' 86755972
'。
Could anyone help me with this? 任何人都可以帮我这个吗? Thanks!
谢谢!
Use a regex like 使用正则表达式
^Sequence:[^|]*\|(?P<hitsid>[^|]*)\|\S*\s*(?P<protein>[^][]*?)\s*\[(?P<organism>[^][]*)][\s\S]*?\nE-value:\s*(?P<evalue>.*)
See the regex demo 请参阅正则表达式演示
A sample Python code getting multiple values into a list of dictionaries: 一个示例Python代码将多个值添加到字典列表中:
import re
p = re.compile(r'^Sequence:[^|]*\|(?P<hitsid>[^|]*)\|\S*\s*(?P<protein>[^][]*?)\s*\[(?P<organism>[^][]*)][\s\S]*?\nE-value:\s*(?P<evalue>.*)', re.MULTILINE)
s = "****ALIGNMENT****\nSequence: gi|86755972|gb|ABD15130.1| cold acclimation protein COR413-PM1 [Chimonanthus praecox]\nLength: 201\nE-value: 2.66576e-82\nKYLAMKTDQLAVANMIDSDINELKMATMRLINDASMLGHYGFGTHFLKWLACLAAIYLLILDRTNWRTNMLTSLL...\n+YLAMKTD+ + +I +D+ E+ A +L+ DA+ LG G GT LKW+A AAIYLLILDRTNW+TNMLT+LL...\nEYLAMKTDEWSAQQLIQTDLKEMGKAAKKLVYDATKLGSLGVGTSILKWVASFAAIYLLILDRTNWKTNMLTALL..."
res = [m.groupdict() for m in p.finditer(s)]
for x in res:
print(x['hitsid'])
print(x['protein'])
print(x['organism'])
print(x['evalue'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.