如何根据多个条件正则表达式将句子拆分为多个句子？

Question

I have the below sentences.我有以下句子。 I need to split the sentences into multiple sentences if the sentence has dot or a matched word.如果句子有点或匹配的单词，我需要将句子分成多个句子。

Sentence 1: There was an error while trying to serialize parameter http://uri.org/:Message .句子 1：尝试序列化参数http://uri.org/:Message 时出错。 The InnerException message with data contract name 'enumStatus:' is not expected.数据协定名称为“enumStatus:”的 InnerException 消息不是预期的。

Expected result:预期结果：

senetences =    1. There was an error while trying to serialize parameter http://uri.org/:vMessage.
                2. The InnerException message with data contract name 'enumStatus:' is not expected.

Sentence 2: ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1句子 2：ORA-01756：引用的字符串没有正确终止 ORA-06512：在 module1，第 48 行 ORA-06512：在第 1 行

Expected result:预期结果：

senetences = 1. ORA-01756: quoted string not properly terminated
             2. ORA-06512: at module1, line 48
             3. ORA-06512: at line 1

I am using below regex to split the sentences.我使用下面的正则表达式来拆分句子。

 sentences = re.split(r'(?<=\w\.)\s|ORA-[0-9]{1,8}', input)

Issue here is, for the first case, if any word followed by dot is working fine.这里的问题是，对于第一种情况，如果后跟点的任何单词工作正常。 For the second case, I am able to split the sentence.对于第二种情况，我可以拆分句子。 I have 2 issues.我有2个问题。

It is removing the entire match word 'ORA-'.它正在删除整个匹配词“ORA-”。 But I need the entire word.但我需要整个词。
I am getting 4 sentences instead of 3 sentences.我得到 4 个句子而不是 3 个句子。
1. (first is empty since it has starting word ORA-) （第一个是空的，因为它有起始词 ORA-）
2. quoted string not properly terminated带引号的字符串未正确终止
3. at module1, line 48在模块 1，第 48 行
4. at line 1在第 1 行

I need 3 sentences in this case.在这种情况下，我需要 3 个句子。

Any help would be really appreciated.任何帮助将非常感激。

Answer 1

You may use this regex for splitting:您可以使用此正则表达式进行拆分：

\s+(?=ORA-\d+)|(?<=\.)\s+(?=[A-Z])

RegEx Demo正则表达式演示

RegEx Details:正则表达式详情：

\\s+(?=ORA-\\d+) : Match 1+ whitespace if that is followed by ORA- and 1+ digits \\s+(?=ORA-\\d+) ：如果后面跟着ORA-和 1+ 数字，则匹配 1+ 个空格
| : OR ：或者
(?<=\\.)\\s+(?=[AZ]) : Match 1+ whitespace if that is preceded by a dot and followed by an uppercase letter (?<=\\.)\\s+(?=[AZ]) ：匹配 1+ 个空格，如果前面是一个点，后面跟一个大写字母

Code Demo代码演示

Code:代码：

import re
arr = ["There was an error while trying to serialize parameter http://uri.org/:Message. The InnerException message with data contract name 'enumStatus:' is not expected.", "ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1"]

rx = re.compile(r'\s+(?=\bORA-\d+)|(?<=\.)\s+(?=[A-Z])')
for i in arr: print (rx.split(i))

Output:输出：

['There was an error while trying to serialize parameter http://uri.org/:Message.', "The InnerException message with data contract name 'enumStatus:' is not expected."]
['ORA-01756: quoted string not properly terminated', 'ORA-06512: at module1, line 48', 'ORA-06512: at line 1']

Answer 2

(?<=\w\.)\s|(ORA-[0-9]{1,8})

You can try this and replace by \\n\\1 .你可以试试这个并替换为\\n\\1 。

See demo.见演示。

https://regex101.com/r/8yvUuZ/1/ https://regex101.com/r/8yvUuZ/1/

# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?<=\w\.)\s|(ORA-[0-9]{1,8})"

test_str = ("ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1\n"
    "There was an error while trying to serialize parameter http://uri.org/:Message. The InnerException message with data contract name 'enumStatus:' is not expected.")

subst = "\\n\\1"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

如何根据多个条件正则表达式将句子拆分为多个句子？

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-07-12 10:06:42

解决方案2
0 2021-07-12 10:07:03

如何根据多个条件正则表达式将句子拆分为多个句子？

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-07-12 10:06:42

解决方案2 0 2021-07-12 10:07:03

解决方案1
1 已采纳 2021-07-12 10:06:42

解决方案2
0 2021-07-12 10:07:03