[英]Sentence segmentation with Regex in Python
I am writing a script to split the text into sentences with Python. 我正在编写一个脚本,使用Python将文本拆分为句子。 However I am quite bad with writing more complex regular expressions. 但是,我对编写更复杂的正则表达式非常不好。
There are 5 rules according to which I wish to split the sentences. 我希望根据5条规则来拆分句子。 I want to split sentences if they: 我想拆分句子,如果他们:
* end with "!" or
* end with "?" or
* end with "..." or
* end with "." and the full stop is not followed by a number or
* end with "." and the full stop is followed by a whitespace
What would be the regular expression for this for Python? Python的正则表达式是什么?
You can literally translate your five bullet points to a regular expression: 您可以从字面上将五个要点转换为正则表达式:
!|\?|\.{3}|\.\D|\.\s
Note that I'm simply creating an alternation consisting of five alternatives, each of which represents one of your bullet points: 请注意,我只是在创建一个包含五个替代方案的替代方案,每个替代方案都代表您的要点之一:
!
\\?
\\.{3}
\\.\\D
\\.\\s
Since the dot ( .
) and the question mark ( ?
) are special characters within a regular expression pattern, they need to be escaped by a backslash ( \\
) to be treated as literals. 由于点( .
)和问号( ?
)是正则表达式模式中的特殊字符,因此需要将它们用反斜杠( \\
)进行转义,以将其视为文字。 The pipe ( |
) is the delimiting character between two alternatives. 竖线( |
)是两个替代项之间的分隔符。
Using the above regular expression, you can then split your text into sentences using re.split
. 使用上面的正则表达式,然后可以使用re.split
将文本拆分为句子。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.