简体   繁体   中英

Separate blocks of text python

I am wondering how one could separate the blocks of text within the same text file. The example is below. Basically I have 2 items, one goes from "Channel 9" to the line with "Brief:..", the other one starts with "Southern ..." to again, the "Brief" line. How does one go about separating them into 2 text files with python? I reckon the common divider would be "(female 16+)". Many thanks!


Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left 
$1,100 out
hosted by Peter Hitchener
A woman selling her caravan near Bendigo has been left $1,100 out of 
pocket after an elderly couple made the purchase with counterfeit money. 
The wildlife worker tried to use the notes to pay for a house deposit, but an 
agent noticed the notes were missing the Coat of Arms on one side. 


Brief: Radio & TV
Demographics: 153,000 (male 16+) • 177,000 (female 
16+)

Southern Cross Victoria Bendigo (1 item)


Heathcote Police are warning the residents to be on the 
lookout a
hosted by Jo Hall
Heathcote Police are warning the residents to be on the lookout after a large 
dash of fake $50 note was discovered. Victim Marianne Thomas was given 
counterfeit notes from a caravan. The Heathcote resident tried to pay the 
house deposit and that's when the counterfeit notes were spotted. Thomas 
says the caravan is in town for the Spanish Festival.


Brief: Radio & TV
Demographics: 4,000 (male 16+) • 3,000 (female 16+)

Here's a modified example of something similar I did recently, basically goes through your text and copies over line by line. The core logic is based around appending to the current file name, which is reset after it finds a new section. Will use the first line of the next section as the filename.

#!/usr/bin/env python
import re

data = """
Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left $1,100 out hosted by
Peter Hitchener A woman selling her caravan near Bendigo has been left $1,100
out of pocket after an elderly couple made the purchase with counterfeit money.
The wildlife worker tried to use the notes to pay for a house deposit, but an
agent noticed the notes were missing the Coat of Arms on one side.

Brief: Radio & TV Demographics: 153,000 (male 16+) • 177,000 (female 16+)

Southern Cross Victoria Bendigo (1 item)

Heathcote Police are warning the residents to be on the lookout a hosted by Jo
Hall Heathcote Police are warning the residents to be on the lookout after a
large dash of fake $50 note was discovered. Victim Marianne Thomas was given
counterfeit notes from a caravan. The Heathcote resident tried to pay the house
deposit and that's when the counterfeit notes were spotted. Thomas says the
caravan is in town for the Spanish Festival.

Brief: Radio & TV Demographics: 4,000 (male 16+) • 3,000 (female 16+)
"""



current_file = None
for line in data.split('\n'):

    # Set initial filename
    if current_file == None and line != '':
        current_file = line + '.txt'

    # This is to handle the blank line after Brief
    if current_file == None:
        continue

    text_file = open(current_file, "a")
    text_file.write(line + "\n")
    text_file.close()

    # Reset filename if we have finished this section
    # which is idenfitied by:
    #    starts with Brief - ^Brief
    #    contains some random amount of text - .*
    #    ends with ) - )$
    if re.match(r'^Brief:.*\)$', line) is not None:
        current_file = None

This will output the following files

Channel 9 (1 item).txt
Southern Cross Victoria Bendigo (1 item).txt

Actually, I suspect you actually want to break after a link starting with Demographics: , or before a line ending with (1 item) or (2 items) or similar.

But however you want to break things, there are two steps to this:

  1. Come up with a rule, which you can turn into a function that you call on each line.
  2. Write some code that groups things based on the result of that function.

Let's use your rule. A function for that could be:

def is_last_line(line):
    return line.strip().endswith('(female 16+)')

Now, here's a way you could group things using that function:

i = 1
outfile = open(f'outfile{i}.txt', 'w')
for line in infile:
    outfile.write(line.strip())
    if is_last_line(line):
        i += 1
        outfile = open(f'outfile{i}.txt', 'w')
outfile.close()

There are ways you can make this a lot more concise by using, eg, itertools.groupby , itertools.takewhile , iter , or other functions. Or you can write a generator function that still does things manually, but yield s groups of lines, which would allow the creating of new files to be a lot simpler (and let us use with blocks). But being explicit like this probably makes it easier for a novice to understand (and debug, and expand on later), at the cost of a bit of verbosity.


For example, it's not very clear from the way you phrased your question whether you actually want that Demographics: line to appear in your output files. If you don't, it should be obvious how to change things:

    if not is_last_line(line):
        outfile.write(line.strip())
    else:
        i += 1
        outfile = open(f'outfile{i}.txt', 'w')

Here is something with hardcoding that will get this done:

s = """Channel 9 (1 item)

A woman selling her caravan near Bendigo has been left $1,100 out hosted by Peter Hitchener A woman selling her caravan near Bendigo has been left $1,100 out of pocket after an elderly couple made the purchase with counterfeit money. The wildlife worker tried to use the notes to pay for a house deposit, but an agent noticed the notes were missing the Coat of Arms on one side.

Brief: Radio & TV Demographics: 153,000 (male 16+) • 177,000 (female 16+)

Southern Cross Victoria Bendigo (1 item)

Heathcote Police are warning the residents to be on the lookout a hosted by Jo Hall Heathcote Police are warning the residents to be on the lookout after a large dash of fake $50 note was discovered. Victim Marianne Thomas was given counterfeit notes from a caravan. The Heathcote resident tried to pay the house deposit and that's when the counterfeit notes were spotted. Thomas says the caravan is in town for the Spanish Festival.

Brief: Radio & TV Demographics: 4,000 (male 16+) • 3,000 (female 16+)"""

part_1 = s[s.index("Channel 9"):s.index("Southern Cross")]

part_2 = s[s.index("Southern Cross"):]

And then save them into files.

Looks like the lines that starts with " Demographics: " act as real dividers. I would use regular expressions two ways: first, split the text by those lines; second, extract those lines themselves. Then the results can be combined to reconstruct the blocks:

import re
DIVIDER = 'Demographics: .+' # Make it tunable, in case you change your mind
blocks_1 = re.split(DIVIDER, text)
blocks_2 = re.findall(DIVIDER, text)
blocks = ['\n\n'.join(pair) for pair in zip(blocks_1, blocks_2)
blocks[0]
#Channel 9 (1 item)\n\nA woman selling her caravan near ... 
#... Demographics: 153,000 (male 16+) • 177,000 (female 16+)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM