简体   繁体   中英

Python - how to separate paragraphs from text?

I need to separate texts into paragraphs and be able to work with each of them. How can I do that? Between every 2 paragraphs can be at least 1 empty line. Like this:

Hello world,
  this is an example.

Let´s program something.


Creating  new  program.

Thanks in advance.

这个灵魂工作:

text.split('\n\n')

尝试

result = list(filter(lambda x : x != '', text.split('\n\n')))

Not an entirely trivial problem, and the standard library doesn't seem to have any ready solutions.

Paragraphs in your example are split by at least two newlines, which unfortunately makes text.split("\\n\\n") invalid. I think that instead, splitting by regular expressions is a workable strategy:

import fileinput
import re

NEWLINES_RE = re.compile(r"\n{2,}")  # two or more "\n" characters

def split_paragraphs(input_text=""):
    no_newlines = input_text.strip("\n")  # remove leading and trailing "\n"
    split_text = NEWLINES_RE.split(no_newlines)  # regex splitting

    paragraphs = [p + "\n" for p in split_text if p.strip()]
    # p + "\n" ensures that all lines in the paragraph end with a newline
    # p.strip() == True if paragraph has other characters than whitespace

    return paragraphs

# sample code, to split all script input files into paragraphs
text = "".join(fileinput.input())
for paragraph in split_paragraphs(text):
    print(f"<<{paragraph}>>\n")

Edited to add:

It is probably cleaner to use a state machine approach. Here's a fairly simple example using a generator function, which has the added benefit of streaming through the input one line at a time, and not storing complete copies of the input in memory:

import fileinput

def split_paragraph2(input_lines):
    paragraph = []  # store current paragraph as a list
    for line in input_lines:
        if line.strip():  # True if line is non-empty (apart from whitespace)
            paragraph.append(line)
        elif paragraph:  # If we see an empty line, return paragraph (if any)
            yield "".join(paragraph)
            paragraph = []
    if paragraph:  # After end of input, return final paragraph (if any)
        yield "".join(paragraph)

# sample code, to split all script input files into paragraphs
for paragraph in split_paragraph2(fileinput.input()):
    print(f"<<{paragraph}>>\n")

I usually split then filter out the '' and strip. ;)

a =\
'''
Hello world,
  this is an example.

Let´s program something.


Creating  new  program.


'''

data = [content.strip() for content in a.splitlines() if content]

print(data)

this is worked for me:

text = "".join(text.splitlines())
text.split('something that is almost always used to separate sentences (i.e. a period, question mark, etc.)')

Easier. I had the same problem.

Just replace the double \\n\\n entry by a term that you seldom see in the text (here ¾):


a ='''
Hello world,
  this is an example.

Let´s program something.


Creating  new  program.'''
a = a.replace("\n\n" , "¾")

splitted_text = a.split('¾')

print(splitted_text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM