how to split the sentence into multiple sentence based on multiple condition regex?

Question

I have the below sentences. I need to split the sentences into multiple sentences if the sentence has dot or a matched word.

Sentence 1: There was an error while trying to serialize parameter http://uri.org/:Message . The InnerException message with data contract name 'enumStatus:' is not expected.

Expected result:

senetences =    1. There was an error while trying to serialize parameter http://uri.org/:vMessage.
                2. The InnerException message with data contract name 'enumStatus:' is not expected.

Sentence 2: ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1

Expected result:

senetences = 1. ORA-01756: quoted string not properly terminated
             2. ORA-06512: at module1, line 48
             3. ORA-06512: at line 1

I am using below regex to split the sentences.

 sentences = re.split(r'(?<=\w\.)\s|ORA-[0-9]{1,8}', input)

Issue here is, for the first case, if any word followed by dot is working fine. For the second case, I am able to split the sentence. I have 2 issues.

It is removing the entire match word 'ORA-'. But I need the entire word.
I am getting 4 sentences instead of 3 sentences.
1. (first is empty since it has starting word ORA-)
2. quoted string not properly terminated
3. at module1, line 48
4. at line 1

I need 3 sentences in this case.

Any help would be really appreciated.

Answer 1

You may use this regex for splitting:

\s+(?=ORA-\d+)|(?<=\.)\s+(?=[A-Z])

RegEx Demo

RegEx Details:

\\s+(?=ORA-\\d+) : Match 1+ whitespace if that is followed by ORA- and 1+ digits
| : OR
(?<=\\.)\\s+(?=[AZ]) : Match 1+ whitespace if that is preceded by a dot and followed by an uppercase letter

Code Demo

Code:

import re
arr = ["There was an error while trying to serialize parameter http://uri.org/:Message. The InnerException message with data contract name 'enumStatus:' is not expected.", "ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1"]

rx = re.compile(r'\s+(?=\bORA-\d+)|(?<=\.)\s+(?=[A-Z])')
for i in arr: print (rx.split(i))

Output:

['There was an error while trying to serialize parameter http://uri.org/:Message.', "The InnerException message with data contract name 'enumStatus:' is not expected."]
['ORA-01756: quoted string not properly terminated', 'ORA-06512: at module1, line 48', 'ORA-06512: at line 1']

Answer 2

(?<=\w\.)\s|(ORA-[0-9]{1,8})

You can try this and replace by \\n\\1 .

See demo.

https://regex101.com/r/8yvUuZ/1/

# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?<=\w\.)\s|(ORA-[0-9]{1,8})"

test_str = ("ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1\n"
    "There was an error while trying to serialize parameter http://uri.org/:Message. The InnerException message with data contract name 'enumStatus:' is not expected.")

subst = "\\n\\1"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

how to split the sentence into multiple sentence based on multiple condition regex?

Question

2 answers

solution1
1 ACCPTED 2021-07-12 10:06:42

solution2
0 2021-07-12 10:07:03

how to split the sentence into multiple sentence based on multiple condition regex?

Question

2 answers

solution1 1 ACCPTED 2021-07-12 10:06:42

solution2 0 2021-07-12 10:07:03

solution1
1 ACCPTED 2021-07-12 10:06:42

solution2
0 2021-07-12 10:07:03