简体   繁体   中英

how can I perform conditional splitting with exceptions in python

I want to split a string into sentences.

But there is some exceptions that I did not expected:

str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."

Desired split:

split = ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']

How can I do using regex python

My efforts so far,

str = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
split = re.split('(?<=[.|?|!|...])\s', str)
print(split)

I got:

['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE.', 'Name.', 'Text.']

Expect:

['UPPERCASE.UPPERCASE. Name.']

The \\s in [AZ]+\\. Name [AZ]+\\. Name do not split

You can use

(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+

See the regex demo . Details:

  • (?<=[.?!]) - a positive lookbehind that requires . , ? or ! immediately to the left of the current location
  • (?<![AZ]\\.(?=\\s+Name)) - a negative lookbehind that fails the match if there is an uppercase letter and a . followed with 1+ whitespaces + Name immediately to the left of the current location (note the + is used in the lookahead , that is why it works with Python re , and \\s+ in the lookahead is necessary to check for the Name presence after whitespace that will be matched and consumed with the next \\s+ pattern below)
  • \\s+ - one or more whitespace chars.

See the Python demo :

import re
text = "Text... Text. Text! Text? UPPERCASE.UPPERCASE. Name. Text."
print(re.split(r'(?<=[.?!])(?<![A-Z]\.(?=\s+Name))\s+', text))
# => ['Text...', 'Text.', 'Text!', 'Text?', 'UPPERCASE.UPPERCASE. Name.', 'Text.']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM