简体   繁体   中英

Python regex to identify two consecutive capitalized words at the beginning of the line

I have this piece of text from which I want to remove both occurrences of each of the names, "Remggrehte Sertrro" and "Perrhhfson Forrtdd". I tried applying this regex: ([AZ][az]+(?=\s[AZ])(?:\s[AZ][az]+)+) but it identifies "Remggrehte Sertrro We", "Perrhhfson Forrtdd If" and also "Mash Mush" which is inside the text. Basically I want it to only identify first two capitalized words at the beginning of the line without touching the rest. I am no regex expert and I am not sure how to adapt it.

This is the text:

Remggrehte Sertrro

Remggrehte Sertrro We did want a 4-day work week for years.

Perrhhfson Forrtdd

Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear, the economy Mash Mush will continue to.

Thanks in advance.

You can use this pattern /^([AZ]+.*? ){2}/m if you are always certain that you are getting only two terms with capitalised first letters and always in the first two terms inline. Example working on regex101.com

You can remove the line which only contains the names using re.MULTILINE flag and the following regex: r"^(?:[AZ]\w+\s+[AZ]\w+\s+)$" . This regex will match each name only if it fits in the line without extra text.

Here is a demo:

import re

text = """\
Remggrehte Sertrro

Remggrehte Sertrro We did want a 4-day work week for years.

Perrhhfson Forrtdd

Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
"""

print(re.sub(r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$", "", text, flags=re.MULTILINE))

You get:


Remggrehte Sertrro We did want a 4-day work week for years.


Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.

You don't need the positive lookahead to match the first 2 capitalized words.

In your pattern, this part (?=\s[AZ]) can be omitted as your first assert it and then directly match it.


You could match the first 2 words without a capturing group and assert a whitespace boundary (?!\S) at the right

^[A-Z][a-z]+[^\S\r\n][A-Z][a-z]+(?!\S)

Explanation

  • ^ Start of string
  • [AZ][az]+ Match a char AZ and 1+ lowercase chars az
  • [^\S\r\n] Match a whitespace char except a newline as \s could also match a newline and you want to match two consecutive capitalized words at the beginning of the line
  • [AZ][az]+ Match a char AZ and 1+ lowercase chars az
  • (?!\S) Assert a whitespace boundary at the right

Regex demo

Note that [AZ][az]+ matches only chars az. To match word characters you could use \w instead of [az] only.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM