I want to obtain all the letter-only "chains" of at least 1 word starting with uppercase letters and followed by lowercase letters, connected with either space (" ") or "-" (a "chain" cannot be connected with "-" and with " ")
For example, for the following text:
For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of One-Two-Five-Seven Steps
My output should be
["For", "First Stage", "Start", "Step-One", "Step-Three", "Final Stage", "One-Two-Five-Seven", "Steps"]
Until now, I have tried writing 2 different regexes to solve my problem; first string should return "chains" connected with "-" and the second should return "chains" connected with " ":
import re
list(set(re.findall('([A-Z][a-z]+-)*[A-Z][a-z]+', mystring) + re.findall('([A-Z][a-z]+ )*[A-Z][a-z]+', mystring)))
However, I guess it is something wrong with them, as neither of them is working properly.
You can use
\b[A-Z][a-z]+(?=([-\s]?))(?:\1[A-Z][a-z]+)*\b(?!-[A-Z])
See the regex demo . Details :
\b
- word boundary [AZ][az]+
- an uppercase ASCII letter followed with one or more lowercase ASCII letters (?=([-\s]?))
- a positive lookahead that requires either a -
or whitespace char (1 or 0 times, optionally) immediately to the right of the current location, capturing the char into Group 1 (?:\1[AZ][az]+)*
- zero or more repetitions of
\1
- same text as captured in Group 1 [AZ][az]+
- an uppercase ASCII letter followed with one or more lowercase ASCII letters \b(?!-[AZ])
- a word boundary not followed with -
and an uppercase ASCII letter. See the Python demo :
import re
pattern = r"\b[A-Z][a-z]+(?=([-\s]?))(?:\1[A-Z][a-z]+)*\b(?!-[A-Z])"
text = "For the First Stage, you should press Start and you should follow Step-One and Step-Three. For the Final Stage, you must follow the sequence of steps One-Two-Five Seven // Steps One-Two-Five-Seven"
print( list(set([x.group() for x in re.finditer(pattern, text)])) )
# => ['Step-Three', 'For', 'First Stage', 'Seven', 'One-Two-Five-Seven', 'Start', 'One-Two-Five', 'Steps', 'Step-One', 'Final Stage']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.