简体   繁体   English

在Python中使用Regex进行句子细分

[英]Sentence segmentation with Regex in Python

I am writing a script to split the text into sentences with Python. 我正在编写一个脚本,使用Python将文本拆分为句子。 However I am quite bad with writing more complex regular expressions. 但是,我对编写更复杂的正则表达式非常不好。

There are 5 rules according to which I wish to split the sentences. 我希望根据5条规则来拆分句子。 I want to split sentences if they: 我想拆分句子,如果他们:

* end with "!"  or
* end with "?"  or
* end with "..."  or
* end with "." and the full stop is not followed by a number  or
* end with "." and the full stop is followed by a whitespace

What would be the regular expression for this for Python? Python的正则表达式是什么?

You can literally translate your five bullet points to a regular expression: 您可以从字面上将五个要点转换为正则表达式:

!|\?|\.{3}|\.\D|\.\s

Note that I'm simply creating an alternation consisting of five alternatives, each of which represents one of your bullet points: 请注意,我只是在创建一个包含五个替代方案的替代方案,每个替代方案都代表您的要点之一:

  • !
  • \\?
  • \\.{3}
  • \\.\\D
  • \\.\\s

Since the dot ( . ) and the question mark ( ? ) are special characters within a regular expression pattern, they need to be escaped by a backslash ( \\ ) to be treated as literals. 由于点( . )和问号( ? )是正则表达式模式中的特殊字符,因此需要将它们用反斜杠( \\ )进行转义,以将其视为文字。 The pipe ( | ) is the delimiting character between two alternatives. 竖线( | )是两个替代项之间的分隔符。

Using the above regular expression, you can then split your text into sentences using re.split . 使用上面的正则表达式,然后可以使用re.split将文本拆分为句子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM