Python regex splitting on multiple whitespaces

Question

I am expecting a user input string which I need to split into separate words. The user may input text delimited by commas or spaces.

So for instance the text may be:

hello world this is John . or

hello world this is John or even

hello world, this, is John

How can I efficiently parse that text into the following list?

['hello', 'world', 'this', 'is', 'John']

Thanks in advance.

Answer 1

Use the regular expression: r'[\\s,]+' to split on 1 or more white-space characters ( \\s ) or commas ( , ).

import re

s = 'hello world,    this, is       John'
print re.split(r'[\s,]+', s)

['hello', 'world', 'this', 'is', 'John']

Answer 2

Since you need to split based on spaces and other special characters, the best RegEx would be \\W+ . Quoting from Python re documentation

\\W

When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_] . With LOCALE , it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

For Example,

data = "hello world,    this, is       John"
import re
print re.split("\W+", data)
# ['hello', 'world', 'this', 'is', 'John']

Or, if you have the list of special characters by which the string has to be split, you can do

print re.split("[\s,]+", data)

This splits based on any whitespace character ( \\s ) and comma ( , ).

Answer 3

>>> s = "hello      world this     is            John"
>>> s.split()
['hello', 'world', 'this', 'is', 'John']
>>> s = "hello world, this, is John"
>>> s.split()
['hello', 'world,', 'this,', 'is', 'John']

The first one is correctly parsed by split with no arguments ;)

Then you can :

>>> s = "hello world, this, is John"
>>> def notcoma(ss) :
...     if ss[-1] == ',' :
...             return ss[:-1]
...     else :
...             return ss
... 
>>> map(notcoma, s.split())
['hello', 'world', 'this', 'is', 'John']

Python regex splitting on multiple whitespaces

Question

3 answers

solution1
3 2014-04-29 10:24:18

solution2
2 ACCPTED 2014-04-29 10:26:02

\\W

solution3
1 2014-04-29 10:27:44

Python regex splitting on multiple whitespaces

Question

3 answers

solution1 3 2014-04-29 10:24:18

solution2 2 ACCPTED 2014-04-29 10:26:02

\\W

solution3 1 2014-04-29 10:27:44

solution1
3 2014-04-29 10:24:18

solution2
2 ACCPTED 2014-04-29 10:26:02

solution3
1 2014-04-29 10:27:44