[英]How do I delimit my input by this capture group?
对于此正则表达式:
(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]
我想输入字符串被捕获的匹配拆分\\s
字符-绿色比赛所看到在这里 。
但是,当我运行此命令时:
import re
p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]')
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
re.split(p, test_str)
似乎在[.?!]+
和[A-Z0-9]
给定的区域处分割了字符串(因此错误地省略了它们),并在结果中保留\\s
。
澄清:
输入 : he paid a lot for it. Did he mind
he paid a lot for it. Did he mind
收到的输出 : ['he paid a lot for it','\\s','id he mind']
预期的产出 : ['he paid a lot for it.','Did he mind']
您需要从(\\s)
左右删除捕获组,并将最后一个字符类放到一个预先准备中以将其从匹配项中排除:
p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+\s(?=[A-Z0-9])')
# ^^^^^ ^
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.split(test_str))
regex模式中的任何捕获组都会在re.split
期间在结果数组中创建一个附加元素。
要强制标点符号出现在“句子”中,可以将此匹配的regex与re.findall
:
import re
p = re.compile(r'\s*((?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+)')
test_str = "Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.findall(test_str))
结果:
['Mr. Smith bought cheapsite.com for 1.5 million dollars ie he paid a lot for it.', 'Did he mind?', "Adam Jones Jr. thinks he didn't.", "In any case, this isn't true...", "Well, with a probability of .9 it isn't.23 is the ish.", 'My name is!', "Why wouldn't you... this is.", 'Andrew']
正则表达式遵循您原始模式中的规则:
\\s*
-匹配0或多个空格以从结果中省略 (?:(?:Mr|Dr|Ms|Jr|Sr)\\.|\\.(?!\\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+)
-re.findall捕获并返回的2个替代re.findall
:
(?:(?:Mr|Dr|Ms|Jr|Sr)\\.|\\.(?!\\s+[A-Z0-9])|[^.!?])*
-0个或多个序列。 ..
(?:Mr|Dr|Ms|Jr|Sr)\\.
-缩写标题 \\.(?!\\s+[A-Z0-9])
-匹配一个不跟1个或多个空格的点,再匹配大写字母或数字 [^.!?]
-除以外的任何字符.
, !
和?
要么...
[^.!?]+
-除以外的任何一个或多个字符.
, !
和?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.