简体   繁体   中英

Detecting paragraph breaks with Regex - edge case problem

I have a long text string with poorly formated new lines. I want to remove all of the newlines chars except for when the newline char follows a . , ? , ! , : char (as that would indicate the end of a sentence). I also don't want to remove the newline if it's immediately followed by a number or another newline - that would indicate a chapter ending. I'm using python re for regex.

Here is my regex so far (?<?\.|\:|?|\n)\n(?![\d]|\n)

Regex example with 7 unit test cases: https://regex101.com/r/nG1gU7/1118

My test is failing in the following example:

First paragraph.  <-- note a trailing space(s) after the period
Second paragraph

How do I fix this?

One option is to use the regex PyPi module and in the lookbehind match optional whitespace characters without a newline using [^\S\r\n]* after matching one of ? . : or a newline.

You can shorten using an alternation |to using a character class listing all the characters.


Regex demo (Selected the JavaScript engine for the example)

Investigating the text above, we may conclude that the actual bad formating is resulting from the presence of a lot of spaces. Therefor it colud be treated by subistituting each block of spaced between words/letters ([a-zA-Z])\s+([a-zA-Z]) with one space \1 \2 like

re.sub(r'([a-zA-Z])\s+([a-zA-Z])','\1 \2',Text)

According to the following link: https://regex101.com/r/feRwne/1

finally, copy and paste the result in a word document, to ensure is acceptable or not.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM