为什么在Python中使用正则表达式搜索“不完全等同于切割字符串”？

Question

正如文档所述，使用regex.search(string, pos, endpos)并不完全等同于切割字符串，即regex.search(string[pos:endpos]) 。 它不会进行正则表达式匹配，就像字符串从pos开始一样，因此^与子字符串的开头不匹配，但只匹配整个字符串的实际开头。 但是， $匹配子字符串的结尾或整个字符串。

    >>> re.compile('^am').findall('I am falling in code', 2, 12)
    []        # am is not at the beginning
    >>> re.compile('^am').findall('I am falling in code'[2:12])
    ['am']    # am is the beginning
    >>> re.compile('ing$').findall('I am falling in code', 2, 12)
    ['ing']   # ing is the ending
    >>> re.compile('ing$').findall('I am falling in code'[2:12])
    ['ing']   # ing is the ending

    >>> re.compile('(?<= )am').findall('I am falling in code', 2, 12)
    ['am']    # before am there is a space
    >>> re.compile('(?<= )am').findall('I am falling in code'[2:12])
    []        # before am there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code', 2, 12)
    []        # after ing there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code'[2:12])
    []        # after ing there is no space

    >>> re.compile(r'\bm.....').findall('I am falling in code', 3, 11)
    []
    >>> re.compile(r'\bm.....').findall('I am falling in code'[3:11])
    ['m fall']
    >>> re.compile(r'.....n\b').findall('I am falling in code', 3, 11)
    ['fallin']
    >>> re.compile(r'.....n\b').findall('I am falling in code'[3:11])
    ['fallin']

我的问题是......为什么在开始和结束比赛之间不一致？ 为什么使用pos和endpos将结尾视为真实结束，但是开始/开始不被视为真正的开始/开始？

是否有任何方法可以使用pos和endpos模仿切片？ 因为Python 在切片时复制字符串而不是仅仅引用旧字符串，所以在使用大字符串多次时使用pos和endpos而不是切片会更有效。

Answer 1

例如，起始位置参数pos对于进行词法分析器特别有用。 使用[pos:]和使用pos参数切割字符串之间的性能差异可能看起来微不足道，但事实并非如此; 例如，在JsLex词法分析器中查看此错误报告。

实际上， ^匹配字符串的实际开头; 或者，如果指定了MULTILINE ，也在行的开头; 这也是设计使得基于正则表达式的扫描仪可以容易地区分输入的线路/开始的实际开始和输入线路上/输入内的其他点。

请注意，您还可以使用regex.match(string[, pos[, endpos]])函数将匹配锚定到起始字符串或 pos指定的位置; 因此，而不是做

>>> re.compile('^am').findall('I am falling in code', 2, 12)
[]

你通常会将扫描仪实现为

>>> match = re.compile('am').match('I am falling in code', 2, 12)
>>> match
<_sre.SRE_Match object; span=(2, 4), match='am'>

然后将pos设置为match.end() （在本例中返回4），用于连续匹配操作。

必须从pos正好开始找到匹配：

>>> re.compile('am').match('I am falling in code', 1, 12)
>>>

（注意.match如何通过隐式^而不是输入的.match锚定在输入的开头;实际上这通常是错误的来源，因为人们认为匹配具有隐式^和$ - Python 3.4添加了执行此操作的regex.fullmatch ）

至于为什么endpos参数与pos不一致 - 我endpos知道，但它对我来说也有一些意义，因为在Python 2中没有完全fullmatch并且用$锚定是唯一的方法来确保必须匹配整个范围。

Answer 2

这听起来像Python中的一个错误，但如果你想通过引用切片而不是复制字符串，你可以使用Python内置buffer 。

例如：

s = "long string" * 100
buf = buffer(s)
substr = buf([5:15])

这会创建一个子字符串而不复制数据，因此应该允许有效地拆分大字符串。

为什么在Python中使用正则表达式搜索“不完全等同于切割字符串”？

问题描述

2 个解决方案

解决方案1
1 2015-07-22 12:21:30

解决方案2
0 2015-06-27 02:51:02

为什么在Python中使用正则表达式搜索“不完全等同于切割字符串”？

问题描述

2 个解决方案

解决方案1 1 2015-07-22 12:21:30

解决方案2 0 2015-06-27 02:51:02

解决方案1
1 2015-07-22 12:21:30

解决方案2
0 2015-06-27 02:51:02