简体   繁体   中英

Break a large text file into separate files using a double carriage return

I'm using Python 2.7 with Windows 7. I have a single large text file that I want to break into several smaller files. The format of the file currently looks like this . . .

Double carriage return
Header line
Body (consisting of several lines)
Double carriage return
Header line
Body (consisting of several lines)

I want to create separate text files using the Header line as the file name and the Body as the file content. The Double carriage return identifies the start of a new file.

I've searched Stack Overflow but haven't found what I'm looking for. I'm very new to Python so any help would be much appreciated.

The code I have so far is . . .

fh = open(path/file.txt)
data = fh.read()
doc = re.split(r'[\r\n\r\n]',data)
for para in doc:
    header = re.search('^[1-9].+Chapter', para)
    filename = str(header) + ".txt"
    fwrite = open(filename,"w")
    fwrite.write(para)
    fwrite.close()

I'd like to use the first line as the text file title.

The first line does not open the file properly, this should work assuming everything else exists. The best practice to keep the file opening in a try Exception block

fh = open('path/file.txt','r')
data = fh.read()
doc = re.split(r'[\r\n\r\n]',data)
for para in doc:
    header = re.search('^[1-9].+Chapter', para)
    filename = str(header) + ".txt"
    fwrite = open(filename,"w")
    fwrite.write(para)
    fwrite.close()

The argument to open is a quoted string; you omitted the quotes.

Your code will needlessly pull the entire file into memory -- this is obviously not a problem with small files, but needlessly restricts your program. If there is no need to analyze the lines together, it is better to read one at a time into memory, and then forget it after writing it out again.

Your code hard-codes DOS carriage returns, which is not only tasteless...

Your code does not enforce the requirement that the first line after the separator has to contain the chapter title. If this is not a hard requirement, the replacement code will need some changes. I figured it's better to alert and abort than pull stuff from further down in the file which just happens to match; but with the refactored code, the latter approach isn't really even feasible.

with open('path/file', 'Ur') as input:
    output = None
    for line in input:
        if output is None:
            if 'Chapter' in line and line[0:1].isdigit():
                output = open('.'.join(line.rstrip(), 'txt'), 'w')
            else:
                raise ValueError(
                    'First line in paragraph is not chapter header: '
                    '{}'.format(line.rstrip())
        elif line == '\n':
            output.close()
            output = None
            continue
        output.write(line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM