简体   繁体   English

带有正则表达式的 Python 分区字符串

[英]Python partition string with regular expressions

I am trying to clean text strings using Python's partition and regular expressions.我正在尝试使用 Python 的分区和正则表达式清理文本字符串。 For example:例如:

testString = 'Tre Bröders Väg 6 2tr'
sep = '[0-9]tr'
head,sep,tail = testString.partition(sep)
head
>>>'Tre Br\xc3\xb6ders V\xc3\xa4g 6 2tr'

The head still contains the 2tr that I want to remove.头部仍然包含我要删除的2tr I'm not that good with regex, but shouldn't [0-9] do the trick?我不太擅长正则表达式,但 [0-9] 不应该这样做吗?

The output I would expect from this example would be我期望从这个例子中得到的输出是

head
>>> 'Tre Br\xc3\xb6ders V\xc3\xa4g 6

str.partition does not support regex , hence when you give it a string like - '[0-9]tr' , it is trying to find that exact string in the testString to partition based on, it is not using any regex. str.partition不支持正则表达式,因此当你给它一个像'[0-9]tr'这样的字符串时,它试图在testString找到基于的精确字符串,它不使用任何正则表达式。

According to documentation of str.partition - 根据str.partition文件 -

Split the string at the first occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. 在第一次出现sep时拆分字符串,并返回包含分隔符之前的部分的3元组,分隔符本身以及分隔符之后的部分。 If the separator is not found, return a 3-tuple containing the string itself, followed by two empty strings. 如果找不到分隔符,则返回包含字符串本身的3元组,后跟两个空字符串。

And since you say, you just want the head , you can use re.split() method from re module , with maxsplit set to 1 , and then take its first element, which should be equivalent to what you were trying with str.partition . 既然你说,你只需要head ,你可以使用re模块中的re.split()方法,将maxsplit设置为1 ,然后获取它的第一个元素,它应该与你在str.partition中尝试的str.partition Example - 示例 -

import re
testString = 'Tre Bröders Väg 6 2tr'
sep = '[0-9]tr'
head = re.split(sep,testString,1)[0]

Demo - 演示 -

>>> import re
>>> testString = 'Tre Bröders Väg 6 2tr'
>>> sep = '[0-9]tr'
>>> head = re.split(sep,testString,1)[0]
>>> head
'Tre Bröders Väg 6 '

Plain re.split() method普通的re.split()方法

You can extract the head by using re.split() .您可以使用re.split()提取head

import re

testString = 'Tre Bröders Väg 6 2tr'
sep = r'[0-9]tr'  # "r" is essential here!
head, tail = re.split(sep, testString)
head.strip()
>>>'Tre Bröders Väg 6'

Chocolate sprinkled re.split() method巧克力洒re.split()方法

If you capture sep with () , re.split() behaves like a pseudo re.partition() (There is no such method in Python, actually...)如果你用()捕获sepre.split()行为就像一个伪re.partition() (在 Python 中没有这样的方法,实际上......)

import re

testString = 'Tre Bröders Väg 6 2tr'
sep = r'([0-9]tr)'  # "()" added.
head, sep, tail = re.split(sep, testString)
head, sep, tail
>>>('Tre Bröders Väg 6 ', '2tr', '')

For those still looking for an answer for how to do a regex partition, try the following function: 对于那些仍在寻找如何进行正则表达式分区的答案的人,请尝试以下函数:

import regex # re also works

def regex_partition(content, separator):
    separator_match = regex.search(separator, content)
    if not separator_match:
        return content, '', ''

    matched_separator = separator_match.group(0)
    parts = regex.split(matched_separator, content, 1)

    return parts[0], matched_separator, parts[1]

I arrived here searching for a way to use a regex-based partition()我来到这里是为了寻找一种使用基于正则表达式的partition()

As included in yelichi answer , re.split() can return the separator if it contains a capturing group, so the most basic way of creating a partition function based on regex would be:包含在yelichi answer 中,如果re.split()包含捕获组,则可以返回分隔符,因此基于正则表达式创建分区函数的最基本方法是:

re.split( "(%s)" % sep, testString, 1)

However, this only works for simple regex.但是,这只适用于简单的正则表达式。 If you are splitting by a regex which uses groups (even if non-capturing), it won't provide the expected results.如果您通过使用组的正则表达式进行拆分(即使未捕获),它也不会提供预期的结果。

I first looked at the function provided at skia.heliou answer , but it needlessly runs the regex a second time and, more importantly, fails if the pattern doesn't match itself (it should string.split on matched_separator, not re.split).我首先查看了在skia.heliou answer 中提供的函数,但它不必要地第二次运行正则表达式,更重要的是,如果模式与自身不匹配,则会失败(它应该在matched_separator 上使用string.split,而不是re.split) .

Thus I implemented my own version of a regex-supporting partition():因此,我实现了自己的支持正则表达式的 partition() 版本:

def re_partition(pattern, string, return_match=False):
    '''Function akin to partition() but supporting a regex
    :param pattern: regex used to partition the content
    :param content: string being partitioned
    '''

    match = re.search(pattern, string)

    if not match:
        return string, '', ''

    return string[:match.start()], match if return_match else match.group(0), string[match.end():]

As an additional feature this can return the match object itself rather than only the matched string.作为附加功能,这可以返回匹配对象本身,而不仅仅是匹配的字符串。 This allows you to directly interact with the groups of the separator.这允许您直接与分隔符的组进行交互。

And in iterator form:并以迭代器形式:

def re_partition_iter(pattern, string, return_match=False):
    '''Returns an iterator of re_partition() output'''

    pos = 0
    pattern = re.compile(pattern)
    while True:
        match = pattern.search(string, pos)
        if not match:
            if pos < len(string):  # remove this line if you prefer to receive an empty string
                yield string[pos:]
            break

        yield string[pos:match.start()]
        yield match if return_match else match.group(0)
        pos = match.end()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM