简体   繁体   中英

Splitting using regex is giving unwanted empty strings

In python, I'm am executing this:

>>> re.split("(hello|world|-)", 'hello-world')


I am expecting this:
['hello', '-', 'world']

however, I am getting this:
['', 'hello', '', '-', '', 'world', '']

where is this '' coming from?

I am using python 3 in case it matters


Edit

Many of you are saying I could split it on - however, I want to extract tokens if that makes sense. Example if I had "hellohello---worldhello" . I want it to return

['hello', 'hello', '-', '-', '-', 'world', 'hello']

According to the documentation:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

You could always use filter to control your list if this is your concern.

>>> filter(None, re.split('(hello|world|-)', 'hellohello---worldhello'))
['hello', 'hello', '-', '-', '-', 'world', 'hello']

Or use findall to grab your matches.

>>> re.findall('(hello|world|-)', 'hellohello---worldhello')
['hello', 'hello', '-', '-', '-', 'world', 'hello']

The extra output elements are because you are asking re to split the string on eg hello, so it tries to tell you what is before hello, what is between hello and '-', etc. All are empty strings.

If you change it to :

re.split("(-)", 'hello-world')

You will get the desired result

['hello', '-', 'world']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM