Python正则表达式-将两个单词之间的文本捕获为字符串，然后追加到列表中

Question

This is the structure of the txt file (repeated units of CDS-text-ORIGIN): 这是txt文件的结构（CDS-text-ORIGIN的重复单元）：

     CDS             311..>428
                     /gene="PNR"
                     /codon_start=1
                     /product="photoreceptor-specific nuclear receptor"
                     /protein_id="AAD28302.1"
                     /db_xref="GI:4726077"
                     /translation="METRPTALMSSTVAAAAPAAGAASRKESPGRWGLGEDPT"
 ORIGIN

I want to pull out the text from 311..<428 to GEDPT" as a string The regex I have so far is: 我想将文本从311 .. <428提取为GEDPT”作为字符串到目前为止，我使用的正则表达式是：

compiler = re.compile(r"^\s+CDS\s+(.+)ORIGIN.+", re.DOTALL|re.MULTILINE)

I then use a loop to add each string to a list: 然后，我使用循环将每个字符串添加到列表中：

for line in file:
    match = compiler.match(line)
    if match:
        list.append(str(match.group(1)))

But I keep getting an empty list! 但是我一直在空着清单！ Any ideas why? 有什么想法吗？

Help would be much appreciated, I'm new to this! 帮助将不胜感激，我是新来的！

Answer 1

I am assuming that file is a filepointer such as file = open('filename.txt') . 我假设file是一个file指针，例如file = open('filename.txt') 。 If that is the case then using: 如果是这种情况，请使用：

for line in file:

will break each line on the newline character. 将换行符上的每一行。 So the first three lines will be: 因此，前三行将是：

1: '     CDS             311..>428\n'
2: '                     /gene="PNR"\n'
3: '                     /codon_start=1:\n'

Because each line is separate, you will not match the multiline pattern unless you combine the lines. 由于每行都是分开的，因此除非您将这些行合并，否则您将不会匹配多行模式。 You may want to consider using: 您可能要考虑使用：

compiler = re.compile(r"^\s+CDS\s+(.+?)ORIGIN", re.DOTALL|re.MULTILINE)
fp = open('filename.txt')
all_text = fp.read()         # this reads all the text without splitting on newlines
compiler.findall(all_text)   # returns a list of all matches

Python正则表达式-将两个单词之间的文本捕获为字符串，然后追加到列表中

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-03-10 01:08:03

Python正则表达式-将两个单词之间的文本捕获为字符串，然后追加到列表中

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-03-10 01:08:03

解决方案1
1 已采纳 2017-03-10 01:08:03