Detecting paragraph breaks with Regex - edge case problem

Question

I have a long text string with poorly formated new lines. I want to remove all of the newlines chars except for when the newline char follows a . , ? , ! , : char (as that would indicate the end of a sentence). I also don't want to remove the newline if it's immediately followed by a number or another newline - that would indicate a chapter ending. I'm using python re for regex.

Here is my regex so far (?<?\.|\:|?|\n)\n(?![\d]|\n)

Regex example with 7 unit test cases: https://regex101.com/r/nG1gU7/1118

My test is failing in the following example:

First paragraph.  <-- note a trailing space(s) after the period
Second paragraph

How do I fix this?

Answer 1

One option is to use the regex PyPi module and in the lookbehind match optional whitespace characters without a newline using [^\S\r\n]* after matching one of ? . : or a newline.

You can shorten using an alternation |to using a character class listing all the characters.

(?<![?.:\n\r][^\S\r\n]*)\r?\n(?![\d\r\n])

Regex demo (Selected the JavaScript engine for the example)

Answer 2

Investigating the text above, we may conclude that the actual bad formating is resulting from the presence of a lot of spaces. Therefor it colud be treated by subistituting each block of spaced between words/letters ([a-zA-Z])\s+([a-zA-Z]) with one space \1 \2 like

re.sub(r'([a-zA-Z])\s+([a-zA-Z])','\1 \2',Text)

According to the following link: https://regex101.com/r/feRwne/1

finally, copy and paste the result in a word document, to ensure is acceptable or not.

Detecting paragraph breaks with Regex - edge case problem

Question

2 answers

solution1
1 ACCPTED 2020-12-06 11:52:53

solution2
0 2020-12-06 10:23:54

Detecting paragraph breaks with Regex - edge case problem

Question

2 answers

solution1 1 ACCPTED 2020-12-06 11:52:53

solution2 0 2020-12-06 10:23:54

solution1
1 ACCPTED 2020-12-06 11:52:53

solution2
0 2020-12-06 10:23:54