简体   繁体   中英

the output of re.split in python doesn't make sense to me

print (re.split(r'[a-fA-F]','finqwenlaskdjriewasFSDFddsafsafasa',re.I|re.M))
print (re.split(r'[a-fA-Z]','finqwenlaskdjriewasFSDFddsafsafasa',re.I|re.M))
print (re.split(r'\d*','sdfsfdsfds123212fdsf2'))
print (re.split(r'\d','sdfsfdsfds123212fdsf2'))
print (re.split(r'\w+','dsfsf sdfdsf sdfsdf sfsfd'))

['', 'inqw', 'nl', 'sk', 'jri', 'w', 's', 'S', '', '', 'dsafsafasa']
['', 'inqw', 'nl', 'sk', 'jri', 'w', 's', '', '', '', 'ddsafsafasa']
['sdfsfdsfds', 'fdsf', '']
['sdfsfdsfds', '', '', '', '', '', 'fdsf', '']
['', ' ', ' ', ' ', '']

I think the output here are really weird. The pattern that split the string are turned to '' in the output list sometimes, but disappear other time.

The pattern that split the string are turned to '' in the output list sometimes, but disappear other time.

No, the pattern (or what it matched) is never included in your outputs there. Those '' are what's between the matches. Because that's what re.split does. Your example:

>>> re.split(r'\d','sdfsfdsfds123212fdsf2')
['sdfsfdsfds', '', '', '', '', '', 'fdsf', '']

You're splitting by digits, and the substring '123212' has six digits, so there are five empty strings between them. That's why there are five empty strings in the output there.

First of all, you're essentially providing the maxsplit=10 argument instead of flags=re.I|re.

Secondly, the separators are not turned into '' ; instead that is the string between the separators:

>>> re.split(r':', 'foo:bar::baz:')
['foo', 'bar', '', 'baz', '']

Notice the '' between 2 separators, and also at the end.

The separators themselves are not in the result, unless your regular expression contains capturing groups ( (...) ):

>>> re.split(r'(:)', 'foo:bar::baz:')
['foo', ':', 'bar', ':', '', ':', 'baz', ':', '']

Third: even though r'\\d*' would ordinarily match at the beginning of a string, end of string, and between each character, currently only non-zero-length matches are considered by re.split , thus that pattern behaving like r\\d+ . However such behaviour is subject to change in Python 3.6 and later, and emits a warning FutureWarning: split() requires a non-empty pattern match. on Python 3.5 .

The output isn't weird, it's intentional. From the docs :

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

>>> re.split('(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']

That way, separator components are always found at the same relative indices within the result list.

Emphasis added to point out why this is done. The same applies to "empty" sequences inside the string and non-capturing separators. Basically, there is content before and after a separator - even if the separator is not captured and either content is empty. The similar method str.split actually does the same .

This allows you to always reconstruct the initial string if you know the separator. Capturing the separator and joining, or inserting the separator on joining is equivalent. ''.join(re.split('(%s)' % sep, ':::words::words:::')) == sep.join(re.split('%s' % sep, ':::words::words:::'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM