Regex to split words in Python

I was designing a regex to split all the actual words from a given text :

Input Example:

"John's mom went there, but he wasn't there. So she said: 'Where are you'"

Expected Output:

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]

I thought of a regex like that:


After splitting in Python, the result contains None items and empty spaces.

How to get rid of the None items? And why didn't the spaces match?

Splitting on spaces, will give items like: ["there."]
And splitting on non-letters, will give items like: ["John","s"]
And splitting on non-letters except ' , will give items like: ["'Where","you'"]

Instead of regex, you can use string-functions:

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')

BUT , in your example you do not want to remove apostrophe in John's but you wish to remove it in you!!' . So string operations fails in that point and you need a finely adjusted regex.

EDIT: probably a simple regex can solve your porblem:


It will capture all chars that starts with a letter and keep capturing while next char is an apostrophe or letter.


This second regex is for a very specific situation.... First regex can capture words like you' . This one will aviod this and only capture apostrophe if is is within the word (not in the beginning or in the end). But in that point, a situation raises like, you can not capture the apostrophe Moss' mom with the second regex. You must decide whether you will capture trailing apostrophe in names ending wit s and defining ownership.


rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

UPDATE 2: I found a bug in my regex! It can not capture single letters followed by an apostrophe like A' . Fixed brand new regex is here:


rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']

You have too many capturing groups in your regular expression; make them non-capturing:



>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']

That returns only one element that is empty.

This regex will only allow one ending apostrophe, which may be followed by one more character:



>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]

I am new to python but i think i have figured it out

import re
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
result = re.findall(r"(.+?)[\s'\",!]{1,}", s)

result ['John', 's', 'mom', 'went', 'there', 'but', 'he', 'wasn', 't', 'there.', 'So', 'she', 'said:', 'Where', 'are', 'you']

