简体   繁体   English

如何将所有单词与正则表达式匹配,网址或类似字符除外?

[英]How to match all words with regex, except urls or similiar?

I'm trying to match all words in strings, except for strings with punctuation IN it like URLs. 我正在尝试匹配字符串中的所有单词,除了带有URL的标点符号的字符串。

I've tried many variations but when its working in the second string its wrong in first. 我尝试了许多变体,但是当它在第二个字符串中工作时,第一个出现错误。

s1 = "My dog is nice! My cat not. www.test.org ?"
s2 = "I am."
regex = r"\b\w+\W* \b"
m1 = re.findall(regex, s1)
m2 = re.findall(regex, s2)

Output for m1 is right: m1的输出是正确的:

['My ', 'dog ', 'is ', 'nice! ', 'My ', 'cat ', 'not. ']

Output for m2 is not what I want: m2的输出不是我想要的:

['I ']

... but I want ... 但我想要

['I ', 'am.']

You need an additional check...: 您需要额外的检查...:

regex = r"\b\w+\W* \b|\b\w+\W$"

...to match end cases where space does not follow dot. ...以匹配空间不跟随点结尾的情况。

Working code : 工作代码

import re

s1 = "My dog is nice! My cat not. www.test.org ?"
s2 = "I am."

regex = r"\b\w+\W* \b|\b\w+\W$"

m1 = re.findall(regex, s1)
m2 = re.findall(regex, s2)

print(m1) # ['My ', 'dog ', 'is ', 'nice! ', 'My ', 'cat ', 'not. ']
print(m2) # ['I ', 'am.']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM