简体   繁体   English

Python正则表达式:拆分为空字符串的模式匹配

[英]Python regex: splitting on pattern match that is an empty string

With the re module, it seems that I am unable to split on pattern matches that are empty strings: 使用re模块,看来我无法拆分为空字符串的模式匹配:

>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
['foobarbarbazbar']

In other words, even if a match is found, if it's the empty string, even re.split cannot split the string. 换句话说,即使找到匹配项,即使它是空字符串,即使re.split也无法拆分字符串。

The docs for re.split seem to support my results. re.split文档似乎支持我的结果。

A "workaround" was easy enough to find for this particular case: 对于这种特殊情况,很容易找到“解决方法”:

>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarbazbar').split('qux')
['foobar', 'barbaz', 'bar']

But this is an error-prone way of doing it because then I have to beware of strings that already contain the substring that I'm splitting on: 但这是一种容易出错的方法,因为这样我就必须提防已经包含要分割的子字符串的字符串:

>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarquxbar').split('qux')
['foobar', 'bar', '', 'bar']

Is there any better way to split on an empty pattern match with the re module? 有没有更好的方法可以通过re模块拆分空模式匹配? Additionally, why does re.split not allow me to do this in the first place? 此外,为什么re.split不允许我这样做? I know it's possible with other split algorithms that work with regex; 我知道使用正则表达式的其他拆分算法是可行的。 for example, I am able to do this with JavaScript's built-in String.prototype.split() . 例如,我可以使用JavaScript的内置String.prototype.split()做到这一点。

It is unfortunate that the split requires a non-zero-width match, but it hasn't been to fixed yet, since quite a lot incorrect code depends on the current behaviour by using for example [something]* as the regex. 不幸的是, split需要一个非零宽度的匹配,但尚未解决,因为相当多的不正确代码取决于当前行为,例如使用[something]*作为正则表达式。 Use of such patterns will now generate a FutureWarning and those that never can split anything, throw a ValueError from Python 3.5 onwards: 现在,使用此类模式将生成FutureWarning而那些永远无法拆分的模式将从Python 3.5开始抛出ValueError

>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/re.py", line 212, in split
    return _compile(pattern, flags).split(string, maxsplit)
ValueError: split() requires a non-empty pattern match.

The idea is that after a certain period of warnings, the behaviour can be changed so that your regular expression would work again. 这个想法是在警告一段时间后,可以更改行为,以便您的正则表达式可以再次使用。


If you can't use the regex module, you can write your own split function using re.finditer() : 如果您不能使用regex模块, regex可以使用re.finditer()编写自己的split函数:

def megasplit(pattern, string):
    splits = list((m.start(), m.end()) for m in re.finditer(pattern, string))
    starts = [0] + [i[1] for i in splits]
    ends = [i[0] for i in splits] + [len(string)]
    return [string[start:end] for start, end in zip(starts, ends)]

print(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
print(megasplit(r'o', 'foobarbarbazbar'))

If you are sure that the matches are zero-width only, you can use the starts of the splits for easier code: 如果您确定匹配项仅为零宽度,则可以使用分割的开头来获得更简单的代码:

import re

def zerowidthsplit(pattern, string):
    splits = list(m.start() for m in re.finditer(pattern, string))
    starts = [0] + splits
    ends = splits + [ len(string) ]
    return [string[start:end] for start, end in zip(starts, ends)]

print(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
import regex
x="bazbarbarfoobar"
print regex.split(r"(?<!baz)(?=bar)",x,flags=regex.VERSION1)

You can use regex module here for this. 您可以在这里使用regex模块。

or 要么

(.+?(?<!foo))(?=bar|$)|(.+?foo)$

Use re.findall . 使用re.findall

See demo 观看演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM