简体   繁体   中英

Detecting paragraph breaks with Regex - edge case problem

I have a long text string with poorly formated new lines. I want to remove all of the newlines chars except for when the newline char follows a . , ? , ! , : char (as that would indicate the end of a sentence). I also don't want to remove the newline if it's immediately followed by a number or another newline - that would indicate a chapter ending. I'm using python re for regex.

Here is my regex so far (?<?\.|\:|?|\n)\n(?![\d]|\n)

Regex example with 7 unit test cases: https://regex101.com/r/nG1gU7/1118

My test is failing in the following example:

First paragraph.  <-- note a trailing space(s) after the period
Second paragraph

How do I fix this?

One option is to use the regex PyPi module and in the lookbehind match optional whitespace characters without a newline using [^\S\r\n]* after matching one of ? . : or a newline.

You can shorten using an alternation |to using a character class listing all the characters.

(?<![?.:\n\r][^\S\r\n]*)\r?\n(?![\d\r\n])

Regex demo (Selected the JavaScript engine for the example)

Investigating the text above, we may conclude that the actual bad formating is resulting from the presence of a lot of spaces. Therefor it colud be treated by subistituting each block of spaced between words/letters ([a-zA-Z])\s+([a-zA-Z]) with one space \1 \2 like

re.sub(r'([a-zA-Z])\s+([a-zA-Z])','\1 \2',Text)

According to the following link: https://regex101.com/r/feRwne/1

finally, copy and paste the result in a word document, to ensure is acceptable or not.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM