简体   繁体   中英

Regex for finding chains of >=1 words starting with capital letters and connected with "-" or " "

I want to obtain all the letter-only "chains" of at least 1 word starting with uppercase letters and followed by lowercase letters, connected with either space (" ") or "-" (a "chain" cannot be connected with "-" and with " ")

For example, for the following text:

For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of One-Two-Five-Seven Steps

My output should be

["For", "First Stage", "Start", "Step-One", "Step-Three", "Final Stage", "One-Two-Five-Seven", "Steps"]

Until now, I have tried writing 2 different regexes to solve my problem; first string should return "chains" connected with "-" and the second should return "chains" connected with " ":

import re
list(set(re.findall('([A-Z][a-z]+-)*[A-Z][a-z]+', mystring) + re.findall('([A-Z][a-z]+ )*[A-Z][a-z]+', mystring)))

However, I guess it is something wrong with them, as neither of them is working properly.

You can use

\b[A-Z][a-z]+(?=([-\s]?))(?:\1[A-Z][a-z]+)*\b(?!-[A-Z])

See the regex demo . Details :

  • \b - word boundary
  • [AZ][az]+ - an uppercase ASCII letter followed with one or more lowercase ASCII letters
  • (?=([-\s]?)) - a positive lookahead that requires either a - or whitespace char (1 or 0 times, optionally) immediately to the right of the current location, capturing the char into Group 1
  • (?:\1[AZ][az]+)* - zero or more repetitions of
    • \1 - same text as captured in Group 1
    • [AZ][az]+ - an uppercase ASCII letter followed with one or more lowercase ASCII letters
  • \b(?!-[AZ]) - a word boundary not followed with - and an uppercase ASCII letter.

See the Python demo :

import re
pattern = r"\b[A-Z][a-z]+(?=([-\s]?))(?:\1[A-Z][a-z]+)*\b(?!-[A-Z])"
text = "For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of steps One-Two-Five Seven // Steps One-Two-Five-Seven"
print( list(set([x.group() for x in re.finditer(pattern, text)])) )
# => ['Step-Three', 'For', 'First Stage', 'Seven', 'One-Two-Five-Seven', 'Start', 'One-Two-Five', 'Steps', 'Step-One', 'Final Stage']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM