繁体   English   中英

在给定字符串中的两个特定单词之间打印单词

[英]print words between two particular words in a given string

如果一个特定单词不以另一个特定单词结尾,请保留它。 这是我的字符串:

x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'

我想打印并计算johndead or death or died.之间的所有单词dead or death or died. 如果john没有以任何died or dead or death话结束。 别管它。 从约翰的话开始。

我的代码:

x = re.sub(r'[^\w]', ' ', x)  # removed all dots, commas, special symbols

for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
    print i
    print len([word for word in i.split()])

我的输出:

 got shot 
2
 with his          john got killed or 
6
 with his wife 
3

我想要的输出:

got shot
2
got killed or
3
with his wife
3

我不知道我在哪里做错了。 它只是一个示例输入。 我必须一次检查20,000个输入。

我想,你想重新开始,当你的dead|died|death之前还有另一个john跟随之后发生dead|died|death

然后,您可以通过单词john分割字符串,然后在结果部分上开始匹配:

x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
    m = re.match('(.+?)(dead|died|death)', e)
    if m:
        print(m.group(1))
        print(len(m.group(1).split()))

收益率:

 got shot 
2
 got killed or 
3
 with his wife 
3

另外,请注意在我提出的替换之后(在拆分和匹配之前),字符串如下所示:

john got shot dead john with his john got killed or died in 1990 john with his wife dead or died

即,序列中没有多个空格。 你可以通过以后用空格分割来管理它,但我觉得这有点清洁。

你可以使用这个负面的前瞻性正则表达式:

>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
...     print i.strip()
...     print len([word for word in i.split()])
...

got shot
2
got killed or
3
with his wife
3

而不是你的.*? 这个正则表达式正在使用(?:(?!john).)*? 只有当john不存在于此匹配中时,才会懒惰地匹配任何字符中的0个或多个。

我还建议使用单词边界使其匹配完整的单词:

re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)

代码演示

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM