[英]Python regex - capture text between two words as string, then append to list
This is the structure of the txt file (repeated units of CDS-text-ORIGIN): 这是txt文件的结构(CDS-text-ORIGIN的重复单元):
CDS 311..>428
/gene="PNR"
/codon_start=1
/product="photoreceptor-specific nuclear receptor"
/protein_id="AAD28302.1"
/db_xref="GI:4726077"
/translation="METRPTALMSSTVAAAAPAAGAASRKESPGRWGLGEDPT"
ORIGIN
I want to pull out the text from 311..<428 to GEDPT" as a string The regex I have so far is: 我想将文本从311 .. <428提取为GEDPT”作为字符串到目前为止,我使用的正则表达式是:
compiler = re.compile(r"^\s+CDS\s+(.+)ORIGIN.+", re.DOTALL|re.MULTILINE)
I then use a loop to add each string to a list: 然后,我使用循环将每个字符串添加到列表中:
for line in file:
match = compiler.match(line)
if match:
list.append(str(match.group(1)))
But I keep getting an empty list! 但是我一直在空着清单! Any ideas why?
有什么想法吗?
Help would be much appreciated, I'm new to this! 帮助将不胜感激,我是新来的!
I am assuming that file
is a filepointer such as file = open('filename.txt')
. 我假设
file
是一个file
指针,例如file = open('filename.txt')
。 If that is the case then using: 如果是这种情况,请使用:
for line in file:
will break each line on the newline character. 将换行符上的每一行。 So the first three lines will be:
因此,前三行将是:
1: ' CDS 311..>428\n'
2: ' /gene="PNR"\n'
3: ' /codon_start=1:\n'
Because each line is separate, you will not match the multiline pattern unless you combine the lines. 由于每行都是分开的,因此除非您将这些行合并,否则您将不会匹配多行模式。 You may want to consider using:
您可能要考虑使用:
compiler = re.compile(r"^\s+CDS\s+(.+?)ORIGIN", re.DOTALL|re.MULTILINE)
fp = open('filename.txt')
all_text = fp.read() # this reads all the text without splitting on newlines
compiler.findall(all_text) # returns a list of all matches
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.