简体   繁体   English

为什么在Python中使用正则表达式搜索“不完全等同于切割字符串”?

[英]Why is regex search in substring “not completely equivalent to slicing the string” in Python?

As the documentation stated, using regex.search(string, pos, endpos) is not completely equivalent to slicing the string, ie regex.search(string[pos:endpos]) . 正如文档所述,使用regex.search(string, pos, endpos)并不完全等同于切割字符串,即regex.search(string[pos:endpos]) It won't do regex matching as if the string is starting from pos , so ^ does not match the beginning of the substring , but only matches the real beginning of the whole string. 它不会进行正则表达式匹配,就像字符串从pos开始一样,因此^子字符串的开头不匹配,但只匹配整个字符串的实际开头。 However, $ matches either the end of the substring or the whole string. 但是, $匹配子字符串的结尾或整个字符串。

    >>> re.compile('^am').findall('I am falling in code', 2, 12)
    []        # am is not at the beginning
    >>> re.compile('^am').findall('I am falling in code'[2:12])
    ['am']    # am is the beginning
    >>> re.compile('ing$').findall('I am falling in code', 2, 12)
    ['ing']   # ing is the ending
    >>> re.compile('ing$').findall('I am falling in code'[2:12])
    ['ing']   # ing is the ending

    >>> re.compile('(?<= )am').findall('I am falling in code', 2, 12)
    ['am']    # before am there is a space
    >>> re.compile('(?<= )am').findall('I am falling in code'[2:12])
    []        # before am there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code', 2, 12)
    []        # after ing there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code'[2:12])
    []        # after ing there is no space

    >>> re.compile(r'\bm.....').findall('I am falling in code', 3, 11)
    []
    >>> re.compile(r'\bm.....').findall('I am falling in code'[3:11])
    ['m fall']
    >>> re.compile(r'.....n\b').findall('I am falling in code', 3, 11)
    ['fallin']
    >>> re.compile(r'.....n\b').findall('I am falling in code'[3:11])
    ['fallin']

My questions are... Why is it not consistent between beginning and ending match? 我的问题是......为什么在开始结束比赛之间不一致? Why does using pos and endpos treat the end as the real end, but the start/beginning is not treated as the real start/beginning? 为什么使用posendpos结尾视为真实结束,但是开始/开始不被视为真正的开始/开始?

Is there any approach to make using pos and endpos imitate slicing? 是否有任何方法可以使用posendpos模仿切片? Because Python copies string when slicing instead of just reference the old one, it would be more efficient to use pos and endpos instead of slicing when working with big string multiple times. 因为Python 在切片时复制字符串而不是仅仅引用旧字符串 ,所以在使用大字符串多次时使用posendpos而不是切片会更有效。

The starting position argument pos is especially useful for doing lexical analysers for example. 例如,起始位置参数pos对于进行词法分析器特别有用。 The performance difference between slicing a string with [pos:] and using the pos parameter might seem insignificant, but it certainly is not so; 使用[pos:]和使用pos参数切割字符串之间的性能差异可能看起来微不足道,但事实并非如此; see for example this bug report in the JsLex lexer . 例如,在JsLex词法分析器中查看此错误报告。

Indeed, the ^ matches at the real beginning of the string; 实际上, ^匹配字符串的实际开头; or, if MULTILINE is specified, also at the beginning of line; 或者,如果指定了MULTILINE ,也在行的开头; this is also by design so that a scanner based on regular expressions can easily distinguish between real beginning of line/beginning of input and just some other point on a line/within the input. 这也是设计使得基于正则表达式的扫描仪可以容易地区分输入的线路/开始的实际开始和输入线路上/输入内的其他点。

Do note that you can also use the regex.match(string[, pos[, endpos]]) function to anchor the match to the beginning string or at the position specified by pos ; 请注意,您还可以使用regex.match(string[, pos[, endpos]])函数将匹配锚定到起始字符串 pos指定的位置; thus instead of doing 因此,而不是做

>>> re.compile('^am').findall('I am falling in code', 2, 12)
[]

you'd generally implement a scanner as 你通常会将扫描仪实现为

>>> match = re.compile('am').match('I am falling in code', 2, 12)
>>> match
<_sre.SRE_Match object; span=(2, 4), match='am'>

and then set the pos to match.end() (which in this case returns 4) for the successive matching operations. 然后将pos设置为match.end() (在本例中返回4),用于连续匹配操作。

The match must be found starting exactly at the pos : 必须从pos正好开始找到匹配:

>>> re.compile('am').match('I am falling in code', 1, 12)
>>> 

(Notice how the .match is anchored at the beginning of the input as if by implicit ^ but not to the end of the input; indeed this is often a source of errors as people believe the match has both implicit ^ and $ - Python 3.4 added the regex.fullmatch that does this) (注意.match如何通过隐式^而不是输入的.match锚定在输入的开头;实际上这通常是错误的来源,因为人们认为匹配具有隐式^$ - Python 3.4添加了执行此操作的regex.fullmatch


As for why the endpos parameter is not consistent with the pos - that I do not know exactly, but it also makes some sense to me, as in Python 2 there is no fullmatch and there anchoring with $ is the only way to ensure that the entire span must be matched. 至于为什么endpos参数与pos不一致 - 我endpos知道,但它对我来说也有一些意义,因为在Python 2中没有完全fullmatch并且用$锚定是唯一的方法来确保必须匹配整个范围。

This sounds like a bug in Python, but if you want to do slice by reference instead of copying the strings you can use the Python builtin buffer . 这听起来像Python中的一个错误,但如果你想通过引用切片而不是复制字符串,你可以使用Python内置buffer

For example: 例如:

s = "long string" * 100
buf = buffer(s)
substr = buf([5:15])

This creates a substring without copying the data, so should allow for efficient splitting of large strings. 这会创建一个子字符串而不复制数据,因此应该允许有效地拆分大字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM