简体   繁体   中英

Python - Removing all occurrences of a character only when it appears before non-capitalized words

I have the following string:

mystring= "Foo some \n information \n Bar some \n more \n information \n Baz more \n information"

I would like to keep "\\n" only when it precedes a word that starts with a capital letter. I would like to remove all other instances of "\\n" in my sentence.

Desired output:

"Foo some information \n Bar some more information \n Baz more information"

Is there a way to do this with re.sub? I can think of trying to split the words and use the word[0].isupper() argument. However, I believe there may be a way to identify Capital words with regex.

You may use this negative lookahead regex:

>>> mystring = "Foo some \n information \n Bar some \n more \n information \n Baz more \n information"
>>> print (re.sub(r'\n(?! *[A-Z]) *', '', mystring))
Foo some information
 Bar some more information
 Baz more information

RegEx Details:

  • \\n : Match a line break
  • (?! *[AZ]) * : Negative lookahead to assert we don't have an upper case letter after optional spaces. match 0 or more spaces afterwards.

If the text may span paragraphs (notwithstanding the reference to "sentence" in the question), you could use the regex

 *\n *(?!\n*[A-Z])

(with a space preceding the first * ).

Matches are replaced with a single space.

Demo

This performs the following operations:

 *            * match 0+ spaces 
\n            * match a newline char
 *            * match 0+ spaces
(?!\n*[A-Z])  * match 0+ newlines followed by an uc letter
              * in a negative lookahead

As shown at the link, the text

Now is the time for all good regexers
to social distance themselves.
Here's to negative lookbehinds!

And also to positive lookbehinds!

becomes

Now is the time for all good regexers to social distance themselves.
Here's to negative lookbehinds!

And also to positive lookbehinds!

even though the newline character following negative lookbehinds! is not followed directly by an upper case letter, but by another newline followed by an upper case letter.

If the string ends with a newline it will be removed. That's because I'm using a negative lookahead rather than a positive one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM