[英]print words between two particular words in a given string
如果一个特定单词不以另一个特定单词结尾,请保留它。 这是我的字符串:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
我想打印并计算john
与dead or death or died.
之间的所有单词dead or death or died.
如果john
没有以任何died or dead or death
话结束。 别管它。 从约翰的话开始。
我的代码:
x = re.sub(r'[^\w]', ' ', x) # removed all dots, commas, special symbols
for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
print i
print len([word for word in i.split()])
我的输出:
got shot
2
with his john got killed or
6
with his wife
3
我想要的输出:
got shot
2
got killed or
3
with his wife
3
我不知道我在哪里做错了。 它只是一个示例输入。 我必须一次检查20,000个输入。
我想,你想重新开始,当你的dead|died|death
之前还有另一个john
跟随之后发生dead|died|death
。
然后,您可以通过单词john
分割字符串,然后在结果部分上开始匹配:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
m = re.match('(.+?)(dead|died|death)', e)
if m:
print(m.group(1))
print(len(m.group(1).split()))
收益率:
got shot
2
got killed or
3
with his wife
3
另外,请注意在我提出的替换之后(在拆分和匹配之前),字符串如下所示:
john got shot dead john with his john got killed or died in 1990 john with his wife dead or died
即,序列中没有多个空格。 你可以通过以后用空格分割来管理它,但我觉得这有点清洁。
你可以使用这个负面的前瞻性正则表达式:
>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
... print i.strip()
... print len([word for word in i.split()])
...
got shot
2
got killed or
3
with his wife
3
而不是你的.*?
这个正则表达式正在使用(?:(?!john).)*?
只有当john
不存在于此匹配中时,才会懒惰地匹配任何字符中的0个或多个。
我还建议使用单词边界使其匹配完整的单词:
re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.