提取包含特定单词的行

Question

Input:输入：

ID   aa
AA   Homo sapiens
DR   ac
BB   ad
FT   ae
//
ID   ba
AA   mouse
DR   bc
BB   bd
FT   be
//
ID   ca
AA   Homo sapiens
DR   cc
BB   cd
FT   ce
//

Expected output:预期 output：

DR   ac
FT   ae
//
DR   cc
FT   ce
//

Code:代码：

word = 'Homo sapiens'
with open(input_file, 'r') as txtin, open(output_file, 'w') as txtout:
    
    for block in txtin.read().split('//\n'):   # reading a file in blocks
        if word in block:   # extracted block containing the word, 'Homo sapiens'
            extracted_block = block + '//\n'

            for line in extracted_block.strip().split('\n'):   # divide each block into lines
                if line.startswith('DR   '):
                    dr = line 

                elif line.startswith('FT   '):
                    ft = line

I read the input_file based on '//' (block).我根据“//”（块）读取了 input_file。 And, if the word 'Homo sapiens' is included in the blocks, I extracted the blocks.而且，如果块中包含“智人”一词，我提取了块。 Also, in the block, the line starting with 'DR ' is defined as dr, and the line starting with 'FT ' is defined as ft. How should I write 'output' using dr and ft to get 'Expected output'?此外，在该块中，以“DR”开头的行定义为 dr，以“FT”开头的行定义为 ft。我应该如何使用 dr 和 ft 编写“输出”以获得“预期输出”？

Answer 1

You can write a simple parser with a flag.您可以编写一个带有标志的简单解析器。 In summary, when you reach a line with AA and the word, set the flag True to keep the following fields of interest, until you reach a block end in which case you reset the flag.总之，当您到达带有 AA 和单词的行时，将标志设置为 True 以保留以下感兴趣的字段，直到您到达块末尾，在这种情况下您重置标志。

word = 'Homo sapiens'

with open(input_file, 'r') as txtin, open(output_file, 'w') as txtout:
    keep = False
    for line in txtin:
        if keep and line.startswith(('DR', 'FT', '//')):
            txtout.write(line)
        if line.startswith('//'):
            keep = False # reset the flag on record end
        elif line.startswith('AA') and word in line:
            keep = True

Output: Output：

DR   ac
FT   ae
//
DR   cc
FT   ce
//

NB.注意。 This requires AA to be before the fields to save.这要求 AA 在要保存的字段之前。 If not, you have to parse block by block (keeping the data in memory) with a similar logic如果没有，您必须使用类似的逻辑逐块解析（将数据保存在内存中）

Answer 2

If you are open to a regex based solution, then one option would be to read the entire file into a string and then use re.findall :如果您对基于正则表达式的解决方案持开放态度，那么一种选择是将整个文件读入字符串，然后使用re.findall ：

with open(input_file, 'r') as file:
    inp = file.read()

matches = re.findall(r'(?<=//\n)ID.*?\nAA\s+Homo sapiens\n(DR\s+\w+\n)BB\s+\w+\n(FT\s+\w+\n//\n?)', '//\n' + inp)
for match in matches:
    for line in match:
        print(line, end='')

This prints:这打印：

DR   ac
FT   ae
//
DR   cc
FT   ce
//

Here is a demo showing that the pattern can correctly identify the blocks and DR/FT lines within each matching block.这是一个演示，显示该模式可以正确识别每个匹配块中的块和 DR/FT 线。

提取包含特定单词的行

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-12-21 02:11:53

解决方案2
0 2021-12-21 01:54:58

提取包含特定单词的行

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-12-21 02:11:53

解决方案2 0 2021-12-21 01:54:58

解决方案1
1 已采纳 2021-12-21 02:11:53

解决方案2
0 2021-12-21 01:54:58