how can I perform conditional splitting with exceptions in python

Question

I want to split a string into sentences.

But there is some exceptions that I did not expected:

str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."

Desired split:

split = ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']

How can I do using regex python

My efforts so far,

str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
split = re.split('(?<=[.|?|!|...])\s', str)
print(split)

I got:

['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE.', 'Name.', 'Text.']

Expect:

['UPPERCASE.UPPERCASE. Name.']

The \\s in [AZ]+\\. Name [AZ]+\\. Name do not split

Answer 1

You can use

(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+

See the regex demo . Details:

(?<=[.?!]) - a positive lookbehind that requires . , ? or ! immediately to the left of the current location
(?<![AZ]\\.(?=\\s+Name)) - a negative lookbehind that fails the match if there is an uppercase letter and a . followed with 1+ whitespaces + Name immediately to the left of the current location (note the + is used in the lookahead , that is why it works with Python re , and \\s+ in the lookahead is necessary to check for the Name presence after whitespace that will be matched and consumed with the next \\s+ pattern below)
\\s+ - one or more whitespace chars.

See the Python demo :

import re
text = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
print(re.split(r'(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+', text))
# => ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']

how can I perform conditional splitting with exceptions in python

Question

1 answers

solution1
0 ACCPTED 2020-11-06 08:51:49

how can I perform conditional splitting with exceptions in python

Question

1 answers

solution1 0 ACCPTED 2020-11-06 08:51:49

solution1
0 ACCPTED 2020-11-06 08:51:49