简体   繁体   中英

Regular expression matching repeating pattern across multiple lines

I have a file with a header (indicated with '>') followed by text on the next line. I need to capture the groups that contain identical numbers in the header. In the example text below, I would like to print the first four lines (both headers contain '4471') to one file and the last four lines (headers contain '4527') to a different file.

>VUSY-4471
AAAGTAATTCAGGATGAAGAGAGACTGCT
>XFJG-4471
AATGTTATTCAAGATGAAGATAGGTTGCTGGCTGCA
>Ambtr-4527
GAGGAGCGGGTGATTGCCTTGGTCGTTGGTGGTGG
>Arath-4527
GAAGAGAGAGTGAATGTTCTTGTA

The following regex successfully captures the groups of text when tested in a text editor (see screenshot), but I can't seem to make it work in a python script. Any help would be greatly appreciated!!

>.+?-(\d+)[\S\s]+>.+-\1\n.+

Example of captured text

You can probably save yourself some time figuring out how to solve the entire problem with regular expressions if you break down what you're trying to do: read two lines, decide what file it needs to go to based on the number in the first line, then move on to the next pair until the entire file has been parsed. That way, all you need is a very simple regex to get the number from the first line: ^>.+?-(\\d+)$ or even just >.+-(\\d+) if you're doing it a line at a time.

That regex seems a little over-complicated for just extracting a string of digits. Here's a solution with a simpler regex

import re

pat = re.compile(r'(\d+)')

with open('infile.txt') as infile:
    for line in infile:
        num = pat.findall(line)[0]
        with open(digits+".txt", "a+") as f:
            f.write(line)
            f.write(next(infile))  # This assumes an even number of lines in the input file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM