简体   繁体   中英

Python regex splitting on multiple whitespaces

I am expecting a user input string which I need to split into separate words. The user may input text delimited by commas or spaces.

So for instance the text may be:

hello world this is John . or

hello world this is John or even

hello world, this, is John

How can I efficiently parse that text into the following list?

['hello', 'world', 'this', 'is', 'John']

Thanks in advance.

Use the regular expression: r'[\\s,]+' to split on 1 or more white-space characters ( \\s ) or commas ( , ).

import re

s = 'hello world,    this, is       John'
print re.split(r'[\s,]+', s)

['hello', 'world', 'this', 'is', 'John']

Since you need to split based on spaces and other special characters, the best RegEx would be \\W+ . Quoting from Python re documentation

\\W

When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_] . With LOCALE , it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

For Example,

data = "hello world,    this, is       John"
import re
print re.split("\W+", data)
# ['hello', 'world', 'this', 'is', 'John']

Or, if you have the list of special characters by which the string has to be split, you can do

print re.split("[\s,]+", data)

This splits based on any whitespace character ( \\s ) and comma ( , ).

>>> s = "hello      world this     is            John"
>>> s.split()
['hello', 'world', 'this', 'is', 'John']
>>> s = "hello world, this, is John"
>>> s.split()
['hello', 'world,', 'this,', 'is', 'John']

The first one is correctly parsed by split with no arguments ;)

Then you can :

>>> s = "hello world, this, is John"
>>> def notcoma(ss) :
...     if ss[-1] == ',' :
...             return ss[:-1]
...     else :
...             return ss
... 
>>> map(notcoma, s.split())
['hello', 'world', 'this', 'is', 'John']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM