简体   繁体   中英

Python Regex Split Keeps Split Pattern Characters

Easiest way to explain this is an example: I have this string: 'Docs/src/Scripts/temp' Which I know how to split two different ways:

re.split('/', 'Docs/src/Scripts/temp') -> ['Docs', 'src', 'Scripts', 'temp']

re.split('(/)', 'Docs/src/Scripts/temp') -> ['Docs', '/', 'src', '/', 'Scripts', '/', 'temp']

Is there a way to split by the forward slash, but keep the slash part of the words? For example, I want the above string to look like this:

['Docs/', '/src/', '/Scripts/', '/temp']

Any help would be appreciated!

Interesting question, I would suggest doing something like this:

>>> 'Docs/src/Scripts/temp'.replace('/', '/\x00/').split('\x00')
['Docs/', '/src/', '/Scripts/', '/temp']

The idea here is to first replace all / characters by two / characters separated by a special character that would not be a part of the original string. I used a null byte ( '\\x00' ), but you could change this to something else, then finally split on that special character.

Regex isn't actually great here because you cannot split on zero-length matches, and re.findall() does not find overlapping matches, so you would potentially need to do several passes over the string.

Also, re.split('/', s) will do the same thing as s.split('/') , but the second is more efficient.

A solution without split() but with lookaheads:

>>> s = 'Docs/src/Scripts/temp'
>>> r = re.compile(r"(?=((?:^|/)[^/]*/?))")
>>> r.findall(s)
['Docs/', '/src/', '/Scripts/', '/temp']

Explanation:

(?=        # Assert that it's possible to match...
 (         # and capture...
  (?:^|/)  #  the start of the string or a slash
  [^/]*    #  any number of non-slash characters
  /?       #  and (optionally) an ending slash.
 )         # End of capturing group
)          # End of lookahead

Since a lookahead assertion is tried at every position in the string and doesn't consume any characters, it doesn't have a problem with overlapping matches.

1) You do not need regular expressions to split on a single fixed character:

>>> 'Docs/src/Scripts/temp'.split('/')

['Docs', 'src', 'Scripts', 'temp']

2) Consider using this method:

import os.path

def components(path):
    start = 0
    for end, c in enumerate(path):
        if c == os.path.sep:
            yield path[start:end+1]
            start = end
    yield path[start:]

It doesn't rely on clever tricks like split-join-splitting, which makes it much more readable, in my opinion.

Try about this:

re.split(r'(/)', 'Docs/src/Scripts/temp')

From python's documentation

re.split(pattern, string, maxsplit=0, flags=0)

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

If you don't insist on having slashes on both sides, it's actually quite simple:

>>> re.findall(r"([^/]*/)", 'Docs/src/Scripts/temp')
['Docs/', 'src/', 'Scripts/']

Neither re nor split are really cut out for overlapping strings, so if that's what you really want, I'd just add a slash to the start of every result except the first.

I'm not sure there is an easy way to do this. This is the best I could come up with...

import re

lSplit = re.split('/', 'Docs/src/Scripts/temp')
print [lSplit[0]+'/'] + ['/'+x+'/' for x in lSplit][1:-1] + ['/'+lSplit[len(lSplit)-1]]

Kind of a mess, but it does do what you wanted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM