How to split a text file into smaller files based on regex pattern?

Question

I have a file like the following:

SCN DD1251       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
        DD1271      C           DD1271    R                                     
        DD1351      D           DD1351    B                                     
                    E                                                           
                                                                                
SCN DD1271       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
        DD1301      T           DD1301    A                                     
        DD1251      R           DD1251    C                                     
                                                                                
SCN DD1301       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
        DD1271      A           DD1271    T                                     
                    B                                                           
                    C                                                           
                    D                                                           
                                                                                
SCN DD1351       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    A           DD1251    D                                     
        DD1251      B                                                           
                    C                                                           
                                                                                
SCN DD1451       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    A                                                           
                    B                                                           
                    C                                                           
                                                                                
SCN DD1601       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    A                                                           
                    B                                                           
                    C                                                           
                    D                                                           
                                                                                
SCN GA0101       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    B           GC4251    D                                     
        GC420A      C           GA127A    S                                     
        GA127A      T                                                           
                                                                                
SCN GA0151       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    C           GA0401    R                   G                 
        GA0201      D           GC0051    E                   H                 
        GA0401      B           GA0201    W                                     
        GC0051      A

Where the gap between each record has a newline character followed by 81 spaces.

I have created the following regex expression using regex101.com which seems to match the gaps between each record:

\s{81}\n

Combined with the short loop below to open the file and then write each section to a new file:

delimiter_pattern = re.compile(r"\s{81}\n")

with open("Junctions.txt", "r") as f:
    i = 1
    for line in f:
        if delimiter_pattern.match(line) == False:
            output = open('%d.txt' % i,'w')
            output.write(line)
        else:
            i+=1

However, instead of outputting, say 2.txt as expected below:

SCN DD1271
            UPSTREAM               DOWNSTREAM               FILTER
          NODE     LINK          NODE    LINK                LINK
        DD1301      T           DD1301    A
        DD1251      R           DD1251    C

It instead seems to return nothing at all. I have tried modifying the code like so:

with open("Clean-Junction-Links1.txt", "r") as f:
    i = 1
    output = open('%d.txt' % i,'w')
    for line in f:
        if delimiter_pattern.match(line) == False:
            output.write(line)
        else:
            i+=1

But this instead returns several hundred blank text files.

What is the issue with my code, and how could I modify it to make it work? Failing that, is there a simpler way to split the file on the blank lines without using regex?

Answer 1

You don't need to use a regex to do this because you can detect the gap between blocks easily by using the string strip() method.

input_file = 'Clean-Junction-Links1.txt'

with open(input_file, 'r') as file:
    i = 0
    output = None

    for line in file:
        if not line.strip():  # Blank line?
            if output:
                output.close()
            output = None
        else:
            if output is None:
                i += 1
                print(f'Creating file "{i}.txt"')
                output = open(f'{i}.txt','w')
            output.write(line)

    if output:
        output.close()

print('-fini-')

Another, cleaner and more modular, way to implement it would be to divide the processing up into two independent tasks that logically have very little to do with each other:

Reading the file and grouping the lines of each a record together.
Writing each group of lines to a separate file.

The first can be implemented as a generator function which iteratively collects and yields groups of lines comprising a record. It's the one named extract_records() below.

input_file = 'Clean-Junction-Links1.txt'

def extract_records(filename):
    with open(filename, 'r') as file:
        lines = []
        for line in file:
            if line.strip():  # Not blank?
                lines.append(line)
            else:
                yield lines
                lines = []
        if lines:
            yield lines

for i, record in enumerate(extract_records(input_file), start=1):
    print(f'Creating file {i}.txt')
    with open(f'{i}.txt', 'w') as output:
        output.write(''.join(record))

print('-fini-')

Answer 2

\\s captures spaces and newline, so it's 80 spaces plus one newline to get {81}. You can't get a second newline when iterating line-by-line, for line in f , unless you put in extra logic to account for that. Also, match() returns None, not False.

#! /usr/bin/env python3
import re

delimiter_pattern = re .compile( r'\s{81}' )

with open( 'Junctions.txt', 'r' ) as f:
    i = 1
    for line in f:
        if delimiter_pattern .match( line ) == None:
            output = open( f'{i}.txt', 'a+' )
            output .write( line )
        else:
            i += 1

Answer 3

You are getting blank output because you are checking whether a line matches a bunch of whitespace ( \\s{81}\\n ) and if there is a match, you are writing only that (blank) line. You need to instead print each line as it is read, and then jump to a new file when your pattern matches.

Also, when you use for line in f , the \\n character is stripped out, so your regex will not match.

import re

delimiter_pattern = re.compile(r"\s{81}")

with open("Junctions.txt", "r") as f:
    fileNum = 1
    output = open(f'{fileNum}.txt','w') # f-strings require Python 3.6 but are cleaner
    for line in f:
        if not delimiter_pattern.match(line):
            output.write(line)
        else:
            output.close()
            fileNum += 1
            output = open(f'{fileNum}.txt','w')

    # Close last file
    if not output.closed:
      output.close()

Answer 4

A few things.

The single text file is being produced since you do not open a file for writing in the loop, you open one single one before the loop begins.
Based on your desired output, you do not want to match the regular expression on each line, but rather you want to continue reading the file until you obtain a single record.

I have put together a working solution

with open("Junctions.txt", "r") as f:
        #read file and split on 80 spaces followed by new line
        file = f.read()
        sep = " " * 80 + "\n"
        chunks = file.split(sep)

        #for each chunk of the file write to a txt file
        i = 0
        for chunk in chunks:
            with open('%d.txt' % i, 'w') as outFile:
                outFile.write(chunk)
            i += 1

this will take the file and get a list of all the groups you want by finding the one separator (80 spaces followed by new line)

How to split a text file into smaller files based on regex pattern?

Question

4 answers

solution1
2 ACCPTED 2021-06-21 20:58:01

solution2
1 2021-06-21 20:51:48

solution3
1 2021-06-21 20:55:30

solution4
1 2021-06-21 20:58:09

How to split a text file into smaller files based on regex pattern?

Question

4 answers

solution1 2 ACCPTED 2021-06-21 20:58:01

solution2 1 2021-06-21 20:51:48

solution3 1 2021-06-21 20:55:30

solution4 1 2021-06-21 20:58:09

solution1
2 ACCPTED 2021-06-21 20:58:01

solution2
1 2021-06-21 20:51:48

solution3
1 2021-06-21 20:55:30

solution4
1 2021-06-21 20:58:09