简体   繁体   中英

Creating multiple txt files from a text file

I am trying to take the Federalist Papers from Project Gutenberg and convert them into text documents. The issue with Project Gutenberg is that each paper is not separated out - it reads in as one large text file, so I have to tell Python to create a new text file for each Federalist Paper (they are each contained between the phrase "FEDERALIST No. _" and "PUBLIUS" ).

The code that I have works mostly, but the issue I'm running into is with the first text file it creates (named 1.txt , per my code). When I open this file, it contains the entire original text scraped from Project Gutenberg, not just the text for Federalist 1. The file 2.txt then has the contents only for Federalist 1, which is correctly cutting the text, it's just now offset from the file that it is supposed to be by 1.

I suspect my issue is somewhere in the for -loop, and maybe with how I'm initializing my variables, but I can't see where it's causing this error.

# Importing the doc and creating individual txt files for each federalist paper

url = "https://www.gutenberg.org/files/1404/1404.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

# finding the start and end of the portion of the doc we care about and subsetting
raw.find("FEDERALIST No. 1")
raw.rfind("PUBLIUS")
raw = raw[821:1167459]
# fixing the doc again... yeah this ain't clean but it's right
raw = raw[0:1166638]
# save as txt to work with below
print(raw, file=open("all.txt", "a"))

# looping over the whole text to break it into individual text docs by each
# federalist paper
with open("all.txt") as fo:
    op = ''
    start = 0
    cntr = 1
    paper = 1
    for x in fo.read().split("\n"):  # looping over the text by each line split
        if x == 'FEDERALIST No. ' + str(paper):  # creating new txt if we
                                                 # encounter a new fed paper
            if start == 1:
                with open(str(cntr) + '.txt', 'w') as opf:
                    opf.write(op)
                    opf.close()
                    op = ''
                    cntr += 1
                    paper += 1
            else:
                start = 1
        else:
            if op == '':
                op = x
            else:
                op = op + '\n' + x
    fo.close()

You can use re module to split the text:

import re
import requests


url = "https://www.gutenberg.org/files/1404/1404.txt"
text = requests.get(url).text

r = re.compile(
    r"^(FEDERALIST No\..*?)(?=^PUBLIUS|^FEDERALIST)", flags=re.M | re.S
)
for i, section in enumerate(r.findall(text), 1):
    with open("{}.txt".format(i), "w") as f_out:
        f_out.write(section)

This will create 85 .txt files each containing section from the paper.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM