I want to split a string into sentences.
But there is some exceptions that I did not expected:
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
Desired split:
split = ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
How can I do using regex python
My efforts so far,
str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
split = re.split('(?<=[.|?|!|...])\s', str)
print(split)
I got:
['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE.', 'Name.', 'Text.']
Expect:
['UPPERCASE.UPPERCASE. Name.']
The \\s
in [AZ]+\\. Name
[AZ]+\\. Name
do not split
You can use
(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+
See the regex demo . Details:
(?<=[.?!])
- a positive lookbehind that requires .
, ?
or !
immediately to the left of the current location(?<![AZ]\\.(?=\\s+Name))
- a negative lookbehind that fails the match if there is an uppercase letter and a .
followed with 1+ whitespaces + Name
immediately to the left of the current location (note the +
is used in the lookahead , that is why it works with Python re
, and \\s+
in the lookahead is necessary to check for the Name
presence after whitespace that will be matched and consumed with the next \\s+
pattern below)\\s+
- one or more whitespace chars. See the Python demo :
import re
text = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
print(re.split(r'(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+', text))
# => ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.