Regular expression matching repeating pattern across multiple lines

Question

I have a file with a header (indicated with '>') followed by text on the next line. I need to capture the groups that contain identical numbers in the header. In the example text below, I would like to print the first four lines (both headers contain '4471') to one file and the last four lines (headers contain '4527') to a different file.

>VUSY-4471
AAAGTAATTCAGGATGAAGAGAGACTGCT
>XFJG-4471
AATGTTATTCAAGATGAAGATAGGTTGCTGGCTGCA
>Ambtr-4527
GAGGAGCGGGTGATTGCCTTGGTCGTTGGTGGTGG
>Arath-4527
GAAGAGAGAGTGAATGTTCTTGTA

The following regex successfully captures the groups of text when tested in a text editor (see screenshot), but I can't seem to make it work in a python script. Any help would be greatly appreciated!!

>.+?-(\d+)[\S\s]+>.+-\1\n.+

Example of captured text

Answer 1

You can probably save yourself some time figuring out how to solve the entire problem with regular expressions if you break down what you're trying to do: read two lines, decide what file it needs to go to based on the number in the first line, then move on to the next pair until the entire file has been parsed. That way, all you need is a very simple regex to get the number from the first line: ^>.+?-(\\d+)$ or even just >.+-(\\d+) if you're doing it a line at a time.

Answer 2

That regex seems a little over-complicated for just extracting a string of digits. Here's a solution with a simpler regex

import re

pat = re.compile(r'(\d+)')

with open('infile.txt') as infile:
    for line in infile:
        num = pat.findall(line)[0]
        with open(digits+".txt", "a+") as f:
            f.write(line)
            f.write(next(infile))  # This assumes an even number of lines in the input file

Regular expression matching repeating pattern across multiple lines

Question

2 answers

solution1
0 2019-02-08 03:08:40

solution2
0 ACCPTED 2019-02-08 03:09:03

Regular expression matching repeating pattern across multiple lines

Question

2 answers

solution1 0 2019-02-08 03:08:40

solution2 0 ACCPTED 2019-02-08 03:09:03

solution1
0 2019-02-08 03:08:40

solution2
0 ACCPTED 2019-02-08 03:09:03