简体   繁体   English

如何根据正则表达式模式将文本文件拆分为较小的文件?

[英]How to split a text file into smaller files based on regex pattern?

I have a file like the following:我有一个如下所示的文件:

SCN DD1251       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
        DD1271      C           DD1271    R                                     
        DD1351      D           DD1351    B                                     
                    E                                                           
                                                                                
SCN DD1271       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
        DD1301      T           DD1301    A                                     
        DD1251      R           DD1251    C                                     
                                                                                
SCN DD1301       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
        DD1271      A           DD1271    T                                     
                    B                                                           
                    C                                                           
                    D                                                           
                                                                                
SCN DD1351       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    A           DD1251    D                                     
        DD1251      B                                                           
                    C                                                           
                                                                                
SCN DD1451       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    A                                                           
                    B                                                           
                    C                                                           
                                                                                
SCN DD1601       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    A                                                           
                    B                                                           
                    C                                                           
                    D                                                           
                                                                                
SCN GA0101       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    B           GC4251    D                                     
        GC420A      C           GA127A    S                                     
        GA127A      T                                                           
                                                                                
SCN GA0151       
            UPSTREAM               DOWNSTREAM               FILTER              
          NODE     LINK          NODE    LINK                LINK               
                    C           GA0401    R                   G                 
        GA0201      D           GC0051    E                   H                 
        GA0401      B           GA0201    W                                     
        GC0051      A                                                           

Where the gap between each record has a newline character followed by 81 spaces.每条记录之间的间隙有一个换行符,后跟 81 个空格。

I have created the following regex expression using regex101.com which seems to match the gaps between each record:我使用 regex101.com 创建了以下正则表达式,它似乎与每条记录之间的差距相匹配:

\s{81}\n

Combined with the short loop below to open the file and then write each section to a new file:结合下面的短循环打开文件,然后将每个部分写入一个新文件:

delimiter_pattern = re.compile(r"\s{81}\n")

with open("Junctions.txt", "r") as f:
    i = 1
    for line in f:
        if delimiter_pattern.match(line) == False:
            output = open('%d.txt' % i,'w')
            output.write(line)
        else:
            i+=1

However, instead of outputting, say 2.txt as expected below:但是,不是输出,而是像下面预期的那样说 2.txt:

SCN DD1271
            UPSTREAM               DOWNSTREAM               FILTER
          NODE     LINK          NODE    LINK                LINK
        DD1301      T           DD1301    A
        DD1251      R           DD1251    C

It instead seems to return nothing at all.相反,它似乎什么也没有返回。 I have tried modifying the code like so:我试过像这样修改代码:

with open("Clean-Junction-Links1.txt", "r") as f:
    i = 1
    output = open('%d.txt' % i,'w')
    for line in f:
        if delimiter_pattern.match(line) == False:
            output.write(line)
        else:
            i+=1

But this instead returns several hundred blank text files.但这反而会返回数百个空白文本文件。

What is the issue with my code, and how could I modify it to make it work?我的代码有什么问题,我该如何修改它以使其工作? Failing that, is there a simpler way to split the file on the blank lines without using regex?如果失败,是否有更简单的方法可以在不使用正则表达式的情况下在空行上拆分文件?

You don't need to use a regex to do this because you can detect the gap between blocks easily by using the string strip() method.您不需要使用正则表达式来执行此操作,因为您可以使用 string strip()方法轻松检测块之间的间隙。

input_file = 'Clean-Junction-Links1.txt'

with open(input_file, 'r') as file:
    i = 0
    output = None

    for line in file:
        if not line.strip():  # Blank line?
            if output:
                output.close()
            output = None
        else:
            if output is None:
                i += 1
                print(f'Creating file "{i}.txt"')
                output = open(f'{i}.txt','w')
            output.write(line)

    if output:
        output.close()

print('-fini-')

Another, cleaner and more modular, way to implement it would be to divide the processing up into two independent tasks that logically have very little to do with each other:另一种更简洁、更模块化的实现方式是将处理分成两个独立的任务,这些任务在逻辑上彼此几乎没有关系:

  1. Reading the file and grouping the lines of each a record together.读取文件并将每个记录的行分组在一起。
  2. Writing each group of lines to a separate file.将每组行写入单独的文件。

The first can be implemented as a generator function which iteratively collects and yields groups of lines comprising a record.第一个可以实现为生成器函数,该函数迭代地收集并生成包含记录的行组。 It's the one named extract_records() below.它是下面名为extract_records()的那个。

input_file = 'Clean-Junction-Links1.txt'

def extract_records(filename):
    with open(filename, 'r') as file:
        lines = []
        for line in file:
            if line.strip():  # Not blank?
                lines.append(line)
            else:
                yield lines
                lines = []
        if lines:
            yield lines

for i, record in enumerate(extract_records(input_file), start=1):
    print(f'Creating file {i}.txt')
    with open(f'{i}.txt', 'w') as output:
        output.write(''.join(record))

print('-fini-')

\\s captures spaces and newline, so it's 80 spaces plus one newline to get {81}. \\s捕获空格和换行符,所以它是 80 个空格加一个换行符得到 {81}。 You can't get a second newline when iterating line-by-line, for line in f , unless you put in extra logic to account for that.在逐行迭代时, for line in f的行,您无法获得第二个换行符,除非您添加了额外的逻辑来解决这个问题。 Also, match() returns None, not False.此外, match()返回 None,而不是 False。

#! /usr/bin/env python3
import re

delimiter_pattern = re .compile( r'\s{81}' )

with open( 'Junctions.txt', 'r' ) as f:
    i = 1
    for line in f:
        if delimiter_pattern .match( line ) == None:
            output = open( f'{i}.txt', 'a+' )
            output .write( line )
        else:
            i += 1

You are getting blank output because you are checking whether a line matches a bunch of whitespace ( \\s{81}\\n ) and if there is a match, you are writing only that (blank) line.您得到空白输出,因为您正在检查一行是否与一堆空格( \\s{81}\\n )匹配,如果匹配,则您只写入该(空白)行。 You need to instead print each line as it is read, and then jump to a new file when your pattern matches.您需要在读取时打印每一行,然后在模式匹配时跳转到新文件。

Also, when you use for line in f , the \\n character is stripped out, so your regex will not match.此外,当您for line in f使用for line in f\\n字符将被删除,因此您的正则表达式将不匹配。

import re

delimiter_pattern = re.compile(r"\s{81}")

with open("Junctions.txt", "r") as f:
    fileNum = 1
    output = open(f'{fileNum}.txt','w') # f-strings require Python 3.6 but are cleaner
    for line in f:
        if not delimiter_pattern.match(line):
            output.write(line)
        else:
            output.close()
            fileNum += 1
            output = open(f'{fileNum}.txt','w')

    # Close last file
    if not output.closed:
      output.close()

A few things.一些东西。

  1. The single text file is being produced since you do not open a file for writing in the loop, you open one single one before the loop begins.正在生成单个文本文件,因为您没有在循环中打开用于写入的文件,而是在循环开始之前打开一个文件。

  2. Based on your desired output, you do not want to match the regular expression on each line, but rather you want to continue reading the file until you obtain a single record.根据所需的输出,您不希望在每一行上匹配正则表达式,而是希望继续读取文件,直到获得单个记录。

I have put together a working solution我已经整理了一个可行的解决方案

with open("Junctions.txt", "r") as f:
        #read file and split on 80 spaces followed by new line
        file = f.read()
        sep = " " * 80 + "\n"
        chunks = file.split(sep)

        #for each chunk of the file write to a txt file
        i = 0
        for chunk in chunks:
            with open('%d.txt' % i, 'w') as outFile:
                outFile.write(chunk)
            i += 1

this will take the file and get a list of all the groups you want by finding the one separator (80 spaces followed by new line)这将获取文件并通过找到一个分隔符(80 个空格后跟新行)来获取您想要的所有组的列表

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM