正则表达式匹配多行重复模式

Question

I have a file with a header (indicated with '>') followed by text on the next line. 我有一个文件头（标有“>”），然后在下一行输入文字。 I need to capture the groups that contain identical numbers in the header. 我需要捕获标题中包含相同数字的组。 In the example text below, I would like to print the first four lines (both headers contain '4471') to one file and the last four lines (headers contain '4527') to a different file. 在下面的示例文本中，我想将前四行（两个标头都包含“ 4471”）打印到一个文件，并将后四行（标头包含“ 4527”）打印到另一个文件。

>VUSY-4471
AAAGTAATTCAGGATGAAGAGAGACTGCT
>XFJG-4471
AATGTTATTCAAGATGAAGATAGGTTGCTGGCTGCA
>Ambtr-4527
GAGGAGCGGGTGATTGCCTTGGTCGTTGGTGGTGG
>Arath-4527
GAAGAGAGAGTGAATGTTCTTGTA

The following regex successfully captures the groups of text when tested in a text editor (see screenshot), but I can't seem to make it work in a python script. 当在文本编辑器中测试时，以下正则表达式成功捕获了文本组（请参见屏幕截图），但是我似乎无法使其在python脚本中工作。 Any help would be greatly appreciated!! 任何帮助将不胜感激！！

>.+?-(\d+)[\S\s]+>.+-\1\n.+

Example of captured text 捕获文本的示例

Answer 1

You can probably save yourself some time figuring out how to solve the entire problem with regular expressions if you break down what you're trying to do: read two lines, decide what file it needs to go to based on the number in the first line, then move on to the next pair until the entire file has been parsed. 您可能会节省一些时间，以弄清如果要分解的内容，如何用正则表达式解决整个问题：读两行，根据第一行中的数字确定需要转到的文件，然后继续进行下一个配对，直到解析了整个文件。 That way, all you need is a very simple regex to get the number from the first line: ^>.+?-(\\d+)$ or even just >.+-(\\d+) if you're doing it a line at a time. 这样，您只需要一个非常简单的正则表达式即可从第一行获取数字： ^>.+?-(\\d+)$或什至是>.+-(\\d+)一次。

Answer 2

That regex seems a little over-complicated for just extracting a string of digits. 该正则表达式似乎过于复杂，因为它仅提取一串数字。 Here's a solution with a simpler regex 这是使用更简单的正则表达式的解决方案

import re

pat = re.compile(r'(\d+)')

with open('infile.txt') as infile:
    for line in infile:
        num = pat.findall(line)[0]
        with open(digits+".txt", "a+") as f:
            f.write(line)
            f.write(next(infile))  # This assumes an even number of lines in the input file

正则表达式匹配多行重复模式

问题描述

2 个解决方案

解决方案1
0 2019-02-08 03:08:40

解决方案2
0 已采纳 2019-02-08 03:09:03

正则表达式匹配多行重复模式

问题描述

2 个解决方案

解决方案1 0 2019-02-08 03:08:40

解决方案2 0 已采纳 2019-02-08 03:09:03

解决方案1
0 2019-02-08 03:08:40

解决方案2
0 已采纳 2019-02-08 03:09:03