How to match all words with regex, except urls or similiar?

Question

I'm trying to match all words in strings, except for strings with punctuation IN it like URLs.

I've tried many variations but when its working in the second string its wrong in first.

s1 = "My dog is nice! My cat not. www.test.org ?"
s2 = "I am."
regex = r"\b\w+\W* \b"
m1 = re.findall(regex, s1)
m2 = re.findall(regex, s2)

Output for m1 is right:

['My ', 'dog ', 'is ', 'nice! ', 'My ', 'cat ', 'not. ']

Output for m2 is not what I want:

['I ']

... but I want

['I ', 'am.']

Answer 1

You need an additional check...:

regex = r"\b\w+\W* \b|\b\w+\W$"

...to match end cases where space does not follow dot.

Working code :

import re

s1 = "My dog is nice! My cat not. www.test.org ?"
s2 = "I am."

regex = r"\b\w+\W* \b|\b\w+\W$"

m1 = re.findall(regex, s1)
m2 = re.findall(regex, s2)

print(m1) # ['My ', 'dog ', 'is ', 'nice! ', 'My ', 'cat ', 'not. ']
print(m2) # ['I ', 'am.']

How to match all words with regex, except urls or similiar?

Question

1 answers

solution1
0 ACCPTED 2019-01-19 06:14:33

How to match all words with regex, except urls or similiar?

Question

1 answers

solution1 0 ACCPTED 2019-01-19 06:14:33

solution1
0 ACCPTED 2019-01-19 06:14:33