如何通过此捕获组来限制输入？

Question

对于此正则表达式：

(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]

我想输入字符串被捕获的匹配拆分\\s字符-绿色比赛所看到在这里。

但是，当我运行此命令时：

import re

p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]')

test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"

re.split(p, test_str)

似乎在[.?!]+和[A-Z0-9]给定的区域处分割了字符串（因此错误地省略了它们），并在结果中保留\\s 。

澄清：

输入： he paid a lot for it. Did he mind he paid a lot for it. Did he mind

收到的输出 ： ['he paid a lot for it','\\s','id he mind']

预期的产出 ： ['he paid a lot for it.','Did he mind']

Answer 1

您需要从(\\s)左右删除捕获组，并将最后一个字符类放到一个预先准备中以将其从匹配项中排除：

p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+\s(?=[A-Z0-9])')
#                                          ^^^^^        ^
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.split(test_str))

请参阅IDEONE演示和regex演示。

regex模式中的任何捕获组都会在re.split期间在结果数组中创建一个附加元素。

要强制标点符号出现在“句子”中，可以将此匹配的regex与re.findall ：

import re
p = re.compile(r'\s*((?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+)')
test_str = "Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.findall(test_str))

见IDEONE演示

结果：

['Mr. Smith bought cheapsite.com for 1.5 million dollars ie he paid a lot for it.', 'Did he mind?', "Adam Jones Jr. thinks he didn't.", "In any case, this isn't true...", "Well, with a probability of .9 it isn't.23 is the ish.", 'My name is!', "Why wouldn't you... this is.", 'Andrew']

正则表达式演示

正则表达式遵循您原始模式中的规则：

\\s* -匹配0或多个空格以从结果中省略
(?:(?:Mr|Dr|Ms|Jr|Sr)\\.|\\.(?!\\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+) -re.findall捕获并返回的2个替代re.findall ：
- (?:(?:Mr|Dr|Ms|Jr|Sr)\\.|\\.(?!\\s+[A-Z0-9])|[^.!?])* -0个或多个序列。 ..
  - (?:Mr|Dr|Ms|Jr|Sr)\\. -缩写标题
  - \\.(?!\\s+[A-Z0-9]) -匹配一个不跟1个或多个空格的点，再匹配大写字母或数字
  - [^.!?] -除以外的任何字符. ， ! 和?
要么...
- [^.!?]+ -除以外的任何一个或多个字符. ， ! 和?

如何通过此捕获组来限制输入？

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-11-22 18:33:34

如何通过此捕获组来限制输入？

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-11-22 18:33:34

解决方案1
1 已采纳 2015-11-22 18:33:34