简体   繁体   中英

Python code to to extract contents between multiple "start" and "end"

I am an absolute newbie to python. I have multiple news articles within one text file and there are 2000+ text files. Each news article starts with "Dow Jones Newswires DJDN" and ends with "(END) Dow Jones Newswires".

There is a set of code extracting every contents between multiple "start" and "end" like this:

with open('./news_txt/A_2013.txt') as infile, open('./news_txt/A_2013_a.txt', 'w') as outfile:
copy = False
for line in infile:
    if line.strip() == "Dow Jones Newswires DJDN":
        copy = True
        continue
    elif line.strip() == "(END) Dow Jones Newswires":
        copy = False
        continue
    elif copy:
        outfile.write(line)

However, this code only applies to the situation in which 1) there is only one txt file; 2) all the extracted contents are store in a new txt file.

But what I want is 1) loop every txt files in a path; 2) each extracted content is being saved in a new txt file.

For example, if there are 10 news in a txt, after running the code I should get 10 new txt files storing each news.

Cheers!!

import os, os.path
folder_path = './news_txt'

def num_to_letters(n):
    rs=[]
    while n >0:
         n-=1
         n,r = divmod(n,26)
         rs.insert(0,chr(r+ord('a')))
    return ''.join(rs)

for file_name in os.listdir(folder_path):
    if not file_name.lower().endswith('.txt'):
        continue
    in_file_path = os.path.join(folder_path,file_name)
    
    
    with open(in_file_path,'r') as in_file:
        out_file = None
        num_out_files = 0
        for line in in_file:
            if line.strip() == "Dow Jones Newswires DJDN":
                if out_file is None:
                    num_out_files +=1
                    out_file_name = file_name[:-4]+'_'+num_to_letters(num_out_files)+'.txt'
                    out_file_path = os.path.join(folder_path,out_file_name)
                    out_file = open(out_file_path,'w')
                continue
            elif line.strip() == "(END) Dow Jones Newswires":
                if out_file is not None:
                    out_file.close()
                    out_file = None
                continue
            elif out_file is not None:
                out_file.write(line)
        if out_file is not None:
            out_file.close()

Do yourself a favor and use a regular expression instead:

^Dow Jones Newswires DJDN.+?^\(END\) Dow Jones Newswires

With the modifiers m and s , see a demo on regex101.com .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM