简体   繁体   中英

Python partition string with regular expressions

I am trying to clean text strings using Python's partition and regular expressions. For example:

testString = 'Tre Bröders Väg 6 2tr'
sep = '[0-9]tr'
head,sep,tail = testString.partition(sep)
head
>>>'Tre Br\xc3\xb6ders V\xc3\xa4g 6 2tr'

The head still contains the 2tr that I want to remove. I'm not that good with regex, but shouldn't [0-9] do the trick?

The output I would expect from this example would be

head
>>> 'Tre Br\xc3\xb6ders V\xc3\xa4g 6

str.partition does not support regex , hence when you give it a string like - '[0-9]tr' , it is trying to find that exact string in the testString to partition based on, it is not using any regex.

According to documentation of str.partition -

Split the string at the first occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing the string itself, followed by two empty strings.

And since you say, you just want the head , you can use re.split() method from re module , with maxsplit set to 1 , and then take its first element, which should be equivalent to what you were trying with str.partition . Example -

import re
testString = 'Tre Bröders Väg 6 2tr'
sep = '[0-9]tr'
head = re.split(sep,testString,1)[0]

Demo -

>>> import re
>>> testString = 'Tre Bröders Väg 6 2tr'
>>> sep = '[0-9]tr'
>>> head = re.split(sep,testString,1)[0]
>>> head
'Tre Bröders Väg 6 '

Plain re.split() method

You can extract the head by using re.split() .

import re

testString = 'Tre Bröders Väg 6 2tr'
sep = r'[0-9]tr'  # "r" is essential here!
head, tail = re.split(sep, testString)
head.strip()
>>>'Tre Bröders Väg 6'

Chocolate sprinkled re.split() method

If you capture sep with () , re.split() behaves like a pseudo re.partition() (There is no such method in Python, actually...)

import re

testString = 'Tre Bröders Väg 6 2tr'
sep = r'([0-9]tr)'  # "()" added.
head, sep, tail = re.split(sep, testString)
head, sep, tail
>>>('Tre Bröders Väg 6 ', '2tr', '')

For those still looking for an answer for how to do a regex partition, try the following function:

import regex # re also works

def regex_partition(content, separator):
    separator_match = regex.search(separator, content)
    if not separator_match:
        return content, '', ''

    matched_separator = separator_match.group(0)
    parts = regex.split(matched_separator, content, 1)

    return parts[0], matched_separator, parts[1]

I arrived here searching for a way to use a regex-based partition()

As included in yelichi answer , re.split() can return the separator if it contains a capturing group, so the most basic way of creating a partition function based on regex would be:

re.split( "(%s)" % sep, testString, 1)

However, this only works for simple regex. If you are splitting by a regex which uses groups (even if non-capturing), it won't provide the expected results.

I first looked at the function provided at skia.heliou answer , but it needlessly runs the regex a second time and, more importantly, fails if the pattern doesn't match itself (it should string.split on matched_separator, not re.split).

Thus I implemented my own version of a regex-supporting partition():

def re_partition(pattern, string, return_match=False):
    '''Function akin to partition() but supporting a regex
    :param pattern: regex used to partition the content
    :param content: string being partitioned
    '''

    match = re.search(pattern, string)

    if not match:
        return string, '', ''

    return string[:match.start()], match if return_match else match.group(0), string[match.end():]

As an additional feature this can return the match object itself rather than only the matched string. This allows you to directly interact with the groups of the separator.

And in iterator form:

def re_partition_iter(pattern, string, return_match=False):
    '''Returns an iterator of re_partition() output'''

    pos = 0
    pattern = re.compile(pattern)
    while True:
        match = pattern.search(string, pos)
        if not match:
            if pos < len(string):  # remove this line if you prefer to receive an empty string
                yield string[pos:]
            break

        yield string[pos:match.start()]
        yield match if return_match else match.group(0)
        pos = match.end()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM