简体   繁体   English

按句点分隔句子,后跟一个大写字母

[英]Split sentence by period followed by a capital letter

I'm trying to find a regex that will split a piece of text into sentences at . 我正在尝试找到一个正则表达式,它将在处将一段文本拆分成多个句子. / ? / ? / ! / ! that is followed by a space that is followed by a capital letter. 后面跟一个空格,再跟一个大写字母。

"Hello there, my friend. In other words, i.e. what's up, man."

should split to: 应该拆分为:

Hello there, my friend| In other words, i.e. what's up, man|

I can get it to split on . 我可以理解一下. / ? / ? / ! / ! , but I have no luck getting the space and capital letter criteria. ,但我无法掌握空格和大写字母的标准。

What I came up with: 我想到的是:

.split("/. \s[A-Z]/")

Split a piece of text into sentences based on the criteria that it is a ./?/! 根据其为./?/!的标准将一段文本拆分为多个句子。 that is followed by a space that is followed by a capital letter. 后面跟一个空格,再跟一个大写字母。

You may use a regex based on a lookahead: 您可以基于前瞻使用正则表达式:

s = "Hello there, my friend. In other words, i.e. what's up, man."
puts s.split(/[!?.](?=\s+\p{Lu})/)

See the Ruby demo . 参见Ruby演示 In case you also need to split with the punctuation at the end of the string, use /[!?.](?=(?:\\s+\\p{Lu})|\\s*\\z)/ . 如果还需要在字符串末尾使用标点符号进行拆分,请使用/[!?.](?=(?:\\s+\\p{Lu})|\\s*\\z)/

Details : 详细资料

  • [!?.] - matches a ! [!?.] -匹配一个! , ? ? or . . that is... 那是...
  • (?=\\s+\\p{Lu}) - (a positive lookahead) followed with 1+ whitespaces followed with 1 uppercase letter immediately to the right of the current location. (?=\\s+\\p{Lu}) -(正向超前),后跟1+个空格,后跟当前位置右侧的1个大写字母。

See the Rubular demo . 参见Rubular演示

NOTE : If you need to split regular English text into sentences, you should consider using existing NLP solutions/libraries. 注意 :如果需要将常规英语文本拆分为句子,则应考虑使用现有的NLP解决方案/库。 See: 看到:

The latter is based on regex, and can easily be extended with more regular expressions. 后者基于正则表达式,可以使用更多的正则表达式轻松扩展。

Apart from Wiktor's Answer you can also use lookarounds to find zero width and split on it. 除了Wiktor的答案外,您还可以使用环视功能查找零宽度并将其拆分。

Regex: (?<=[.?!]\\s)(?=[AZ]) finds zero width preceded by either [.?!] and space and followed by an upper case letter. 正则表达式: (?<=[.?!]\\s)(?=[AZ])查找零宽度,后跟[.?!]和空格,再后跟一个大写字母。

s = "Hello there, my friend. In other words, i.e. what's up, man."
puts s.split(/(?<=[.?!]\s)(?=[A-Z])/)

Output 输出量

Hello there, my friend. 
In other words, i.e. what's up, man.

Ruby Demo Ruby示范


Update: Based on Cary Swoveland's comment . 更新:基于Cary Swoveland的评论

If the OP wanted to break the string into sentences I'd suggest (?<=[.?!])\\s+(?=[AZ]) , as it removes spaces between sentences and permits the number of such spaces to be greater than one 如果OP希望将字符串分成句子,我建议(?<=[.?!])\\s+(?=[AZ]) ,因为它会删除句子之间的空格,并允许这些空格的数量更大多于一个

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM